Author

Theresa Szczepanski

Research Question

I work for a public charter school in Massachusetts. Our students in Grades 5-10 are tested annually in Mathematics, English Language Arts, and Science. I am interested in using the student performance data to identify areas of weakness in our program in terms of curricular alignment. For every question on a given assessment, the state releases an item description detailing what was asked of students and the corresponding average score earned by students in our school as well as the average score earned by students across Massachusetts.

My research question is: “Can test item descriptions on MCAS assessments be mined to extract features associated with student performance trends?”

Hypothesis

I have already found statistically significant patterns in student performance associated with an item’s content reporting category in every subject and grade level at our school.

In Mathematics, I have identified specific content reporting categories that are relative weaknesses at different grade levels; however, this does not take into account the differences in questions like those Uurrutia and Araya classified in open-ended Math prompts. I would be curious to see if our students are weaker in items that ask them to evaluate an expression (apply technical skills in algebra/numeracy) vs. construct or interpret a mathematical model (conceptual understanding). This would be very interesting information for teachers.

When interviewing one of our experienced teachers who has historical success with student achievement in English Language Arts (ELA), she identified specific things that she believes all kids need to practice for success with ELA, namely, “synthesizing multiple texts” and “practice at the differences in reading informational text vs. literature”. These are requirements in questions that can be mined from item descriptions but not from an item’s reporting category or standard description. Our students have historically performed weaker on the 7th grade English Language Arts exam than on the 6th and 8th grade exams. This suggests a curricular alignment issue. I’ve already identified reporting categories in which our students have performed relatively weaker on this assessment. I suspect that within these reporting categories there exist patterns to the types of questions or tasks that our students struggle with. This could provide valuable information for teachers to adjust their instruction and instructional materials.

Hypotheses:

H1: A predictive model of student performance on Grade 5-10 Mathematics MCAS assessment items that includes regressors taken from the test item descriptions will outperform a baseline predictive model that includes only a given test item’s content reporting category.

H2: A predictive model of student performance on Grade 5-10 English Language Arts MCAS assessment items that includes regressors taken from the test item descriptions will outperform a baseline predictive model that includes only a given test item’s content reporting category.

H3: A predictive model of student performance on Grade 5-8 grade Science assessment items that includes regressors taken from the test item descriptions will outperform a baseline predictive model that includes only a given test item’s content reporting category.

Data Sources

I scraped the Department of Elementary and Secondary Educations’ accountability page. It includes tables for all Next Generation MCAS tests (beginning in 2018) and text descriptions for all test items. I have actually already done this for the High School Science exams in my Python course last fall. The structures of the tables are similar for the other exams.

Here is my folder of Collab notebooks with my Python code for scraping different grade level and subjects. Here is a link to the notebook I used for the HighSchool Science MCAS

  • Math Corpus: 1010 Documents (item descriptions)

  • STE Corpus: 510 Documents, MS Only: 398 Documents

  • ELA Corpus: 693 Documents

In Uurrutia and Araya’s paper, they used classification to categorize open-ended mathematics questions into different types. I would like to use dictionaries to classify MCAS questions into different categories using the item descriptions.

Then I would like to use Supervised Machine Learning Methods that include the features created from the dictionaries as well as features that I extracted from the MCAS reporting tables to predict student performance relative to their peers in the state, RT-State Diff, on a given item.

I would like to use this same approach for the English Language Arts and Science exams as well.

Load Libraries

Code
library(caret)
Loading required package: ggplot2
Loading required package: lattice
Code
library(devtools)
Loading required package: usethis
Code
library(e1071)
library(ggplot2)
library(ggpubr)
library(hrbrthemes)
library(Metrics)

Attaching package: 'Metrics'
The following objects are masked from 'package:caret':

    precision, recall
Code
library(plotly)

Attaching package: 'plotly'
The following object is masked from 'package:ggplot2':

    last_plot
The following object is masked from 'package:stats':

    filter
The following object is masked from 'package:graphics':

    layout
Code
library(plyr)

Attaching package: 'plyr'
The following objects are masked from 'package:plotly':

    arrange, mutate, rename, summarise
The following object is masked from 'package:ggpubr':

    mutate
Code
library(purrr)

Attaching package: 'purrr'
The following object is masked from 'package:plyr':

    compact
The following object is masked from 'package:caret':

    lift
Code
library(quanteda)
Package version: 4.1.0
Unicode version: 14.0
ICU version: 71.1
Parallel computing: disabled
See https://quanteda.io for tutorials and examples.
Code
library(quanteda.textstats)
library("quanteda.textplots")
library(randomForest)
randomForest 4.7-1.2
Type rfNews() to see new features/changes/bug fixes.

Attaching package: 'randomForest'
The following object is masked from 'package:ggplot2':

    margin
Code
library(RColorBrewer)
library(tidytext)
library(tidyverse)
── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ stringr   1.5.1
✔ forcats   1.0.0     ✔ tibble    3.2.1
✔ lubridate 1.9.3     ✔ tidyr     1.3.1
✔ readr     2.1.5     
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::arrange()       masks plyr::arrange(), plotly::arrange()
✖ dplyr::combine()       masks randomForest::combine()
✖ purrr::compact()       masks plyr::compact()
✖ dplyr::count()         masks plyr::count()
✖ dplyr::desc()          masks plyr::desc()
✖ dplyr::failwith()      masks plyr::failwith()
✖ dplyr::filter()        masks plotly::filter(), stats::filter()
✖ dplyr::id()            masks plyr::id()
✖ dplyr::lag()           masks stats::lag()
✖ purrr::lift()          masks caret::lift()
✖ randomForest::margin() masks ggplot2::margin()
✖ dplyr::mutate()        masks plyr::mutate(), plotly::mutate(), ggpubr::mutate()
✖ dplyr::rename()        masks plyr::rename(), plotly::rename()
✖ dplyr::summarise()     masks plyr::summarise(), plotly::summarise()
✖ dplyr::summarize()     masks plyr::summarize()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
Code
library(tree)
library(viridis)
Loading required package: viridisLite

Read-in/Tidy Data

Most of the cleaning was done by me in the process of scraping the tables, which is included in my folder of Python Collab notebooks.

STE Data Frames

Code
STE_DF<-read_csv("STE_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `discipline core idea`, `standard`, `standard desc`, `pts`, `school%`, `state%`, school_state_diff, grade_level, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = as.factor(grade_level))
Rows: 521 Columns: 15
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): type, reporting category, standard, item description, discipline co...
dbl (7): year, number, pts, school%, state%, school_state_diff, grade_level

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Code
STE_DF

ELA Data Frames

Code
ELA_G5_DF<-read_csv("G5_ELA_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `Cluster`, `text_type`, `standard`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 5)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 171 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): number, item_library_URL, standard_map_URL, type, reporting categor... dbl
(6): ...1, year, pts, school%, state%, school_state_diff lgl (1): released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#ELA_G5_DF


ELA_G6_DF<-read_csv("G6_ELA_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `Cluster`, `text_type`, `standard`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 6)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 171 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): number, item_library_URL, standard_map_URL, type, reporting categor... dbl
(6): ...1, year, pts, school%, state%, school_state_diff lgl (1): released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#ELA_G6_DF

ELA_G7_DF<-read_csv("G7_ELA_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `Cluster`, `text_type`, `standard`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 7)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 172 Columns: 24
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): number, item_library_URL, standard_map_URL, type, reporting categor... dbl
(7): ...1, year, pts, school%, state%, school_state_diff, Unnamed: 10 lgl (8):
released, Unnamed: 3, Unnamed: 4, Unnamed: 5, Unnamed: 6, Unnamed: ...
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#ELA_G7_DF

ELA_G8_DF<-read_csv("G8_ELA_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `Cluster`, `text_type`, `standard`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 8)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 171 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): number, item_library_URL, standard_map_URL, type, reporting categor... dbl
(6): ...1, year, pts, school%, state%, school_state_diff lgl (1): released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#ELA_G8_DF


ELA_G10_DF<-read_csv("G10_ELA_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `Cluster`, `text_type`, `standard`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 10)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 173 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): number, item_library_URL, standard_map_URL, type, reporting categor... dbl
(6): ...1, year, pts, school%, state%, school_state_diff lgl (1): released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#ELA_G10_DF

ELA_DF<-rbind(ELA_G5_DF, ELA_G6_DF)

ELA_DF<-rbind(ELA_DF, ELA_G7_DF)

ELA_DF<-rbind(ELA_DF, ELA_G8_DF)

ELA_DF<-rbind(ELA_DF, ELA_G10_DF)

ELA_DF<-ELA_DF%>%
  filter(`type` == "SR")

ELA_DF

Math Data Frames

Code
Math_G5_DF<-read_csv("G5_Math_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `standard`, `Standard Description`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 5)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 200 Columns: 19
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(11): item_library_URL, standard_map_URL, type, reporting category, stan... dbl
(7): ...1, year, number, pts, school%, state%, school_state_diff lgl (1):
released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#Math_G5_DF


Math_G6_DF<-read_csv("G6_Math_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `standard`, `Standard Description`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 6)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 200 Columns: 17
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): item_library_URL, standard_map_URL, type, reporting category, stand... dbl
(7): ...1, year, number, pts, school%, state%, school_state_diff lgl (1):
released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#Math_G6_DF

Math_G7_DF<-read_csv("G7_Math_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `standard`, `Standard Description`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 7)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 200 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(8): item_library_URL, standard_map_URL, type, reporting category, stand... dbl
(7): ...1, year, number, pts, school%, state%, school_state_diff lgl (1):
released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#Math_G7_DF

Math_G8_DF<-read_csv("G8_Math_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `standard`, `Standard Description`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 8)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 200 Columns: 16
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(8): item_library_URL, standard_map_URL, type, reporting category, stand... dbl
(7): ...1, year, number, pts, school%, state%, school_state_diff lgl (1):
released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#Math_G8_DF


Math_G10_DF<-read_csv("G10_Math_all_years_2024.csv")%>%
  select(`item description`, year, number, type, `reporting category`, `standard`, `Standard Description`,  `pts`, `school%`, `state%`, school_state_diff, item_library_URL)%>%
  mutate(number = as.factor(number))%>%
  mutate(grade_level = 10)%>%
  mutate(grade_level = as.factor(grade_level))
New names:
Rows: 210 Columns: 17
── Column specification
──────────────────────────────────────────────────────── Delimiter: "," chr
(9): item_library_URL, standard_map_URL, type, reporting category, stand... dbl
(7): ...1, year, number, pts, school%, state%, school_state_diff lgl (1):
released
ℹ Use `spec()` to retrieve the full column specification for this data. ℹ
Specify the column types or set `show_col_types = FALSE` to quiet this message.
• `` -> `...1`
Code
#Math_G10_DF

Math_DF<-rbind(Math_G5_DF, Math_G6_DF)

Math_DF<-rbind(Math_DF, Math_G7_DF)

Math_DF<-rbind(Math_DF, Math_G8_DF)

Math_DF<-rbind(Math_DF, Math_G10_DF)

Math_DF

Methods/PreProcessing Approach

For each exam I followed a similar approach:

  1. Create a dataframe of all of the item descriptions and features across all grade levels and subjects. I kept as annotations all of the features provided by the state which include: year, standard, reporting category, cluster (sub category), pts, state%, school%, school_state_diff, andtype(open response or selected response).

  2. Using preText, I looked for recommendations via the lowest score after running factorial_preprocessing and selected tokens accordingly. Interestingly, this recommended to not remove stop words in the Math and Science corpora. I believe this is perhaps because of the key logical role words like “and” ,“but”, “not”, and “or” play. These stop words do dominate the text descriptions though. Because of this, I also created a second set of tokens that included the removal of stopwords that I used for text visualizations and descriptive analysis of text. When applying dictionaries and using machine learning, I will use texts that include stop words. Also, whenever there was a recommendation to use stemming, I instead used lemmatization.

  3. Create subset document matrices for the items that students performed the best on and a matrix for the items that student performed the worst on.

  4. Create visuals of the word connections for all item descriptions as well as for the items descriptions in the best and worst subsets to explore differences.

  5. Using n-grams, identify key phrases from the item descriptions.

  6. Create a dictionary that uses key phrases identified in steps 4 and 5 to tag an item as including any of the following

    Math:

    • modeling: A modeling problem

    • apply_division: Problem requiring the use of division and division facts

    • patterns_algebra: Problem requiring the manipulation of equivalent algebraic or numeric expressions

    ELA:

    • poetry: Students are asked questions involving where the source text is a poem

    • figurative_language: question requiring the analysis of figurative language

    Science:

    • analyze graph: Students are asked questions involving the analysis of a graph

    • analyze model: Students are asked questions involving the analysis of a model

  7. Update the Subject_DF to include the new features identified via the subject area dictionary for each of the documents to use as a feature for Supervised Machine Learning model that will predict school_state_diff.

  8. Create a Random Forest model using the features from the MCAS item reports and tune its parameters using 5-fold cross-validation.

  9. Create a Random Forest model using the features from the MCAS and additional features extracted using dictionaries and the document feature matrices.

  10. Compare the performance of the models as measured by %variance explained and MAE.

Math Exam

Create Corpus

Code
Math_item_corpus <-corpus(Math_DF, text_field = "item description")
Warning: NA is replaced by empty string
Code
#print(Math_item)

#(summary(Math_item_corpus))

Math_DF

PreTextPreProcessing Decisions

Before completing the pre-processing process, I examined the different choices using pre-Text. I was surprised to see that removing stop words had a positive correlation coefficient.

I can see how key logical words like “not”, “and”, and “or”, which are also stop words can have a significant impact on the meaning of an exam question. Perhaps, because each individual text is so small and the texts are designed for assessing content skills and are not narrative text, the stop words play more significant roles?

Given, these results, I will pre-process the data two ways, once using the lowest recommended score, N, but also going lowercase as I can’t see how this has an effect on anything and has a correlation coefficient of 0, and once using methods that includes removing stop words to use for text visualization and simple descriptive summaries of the text:

  • Recommended preText score: “N-L” (remove punctuation and lowercase)

  • Alternative approach: “P-N-L-3” (remove punctuation, remove numbers, lower case, and n-grams)

Code
# Sys.unsetenv("GITHUB_PAT")
# devtools::install_github("matthewjdenny/preText")
# library(preText)
Code
# preprocessed_documents_math <- factorial_preprocessing(
#     Math_item_corpus,
#     use_ngrams = TRUE,
#     infrequent_term_threshold = 0.2,
#     verbose = FALSE)
Code
# #names(preprocessed_documents_math)
# 
# head(preprocessed_documents_math$choices)
Code
# 
# preText_results <- preText(
#     preprocessed_documents_math,
#     dataset_name = "Math MCAS Item Descriptions",
#     distance_method = "cosine",
#     num_comparisons = 20,
#     verbose = FALSE)
# 
# 
# preText_score_plot(preText_results)
Code
# regression_coefficient_plot(preText_results,
#                             remove_intercept = TRUE)

Tokenization 1: N-L

Code
## Extract the tokens

Math_item_tokens <- tokens(Math_item_corpus)

#print(Math_item_tokens)
Code
Math_item_tokens1 <- tokens(Math_item_corpus, 
    remove_numbers = T)


Math_item_tokens1 <- tokens_tolower(Math_item_tokens1)

# Math_item_tokens1 <- tokens_select(Math_item_tokens1,
#                    pattern = stopwords("en"),
#                   selection = "remove")

print(Math_item_tokens1)
Tokens consisting of 1,010 documents and 12 docvars.
text1 :
 [1] "determine"   "the"         "coordinates" "of"          "a"          
 [6] "point"       "in"          "the"         "first"       "quadrant"   
[11] "that"        "will"       
[ ... and 14 more ]

text2 :
 [1] "match"       "numerical"   "expressions" "that"        "involve"    
 [6] "two"         "operations"  "with"        "equivalent"  "word"       
[11] "expressions" "."          

text3 :
 [1] "solve"       "a"           "real-world"  "problem"     "that"       
 [6] "involves"    "multiplying" "a"           "fraction"    "by"         
[11] "a"           "mixed"      
[ ... and 2 more ]

text4 :
 [1] "determine"   "the"         "volume"      "of"          "right"      
 [6] "rectangular" "prisms"      "by"          "counting"    "unit"       
[11] "cubes"       "and"        
[ ... and 6 more ]

text5 :
[1] "round"     "decimals"  "to"        "the"       "nearest"   "hundredth"
[7] "."        

text6 :
 [1] "estimate"  "the"       "sum"       "of"        "two"       "fractions"
 [7] "that"      "are"       "both"      "less"      "than"      "one"      
[ ... and 1 more ]

[ reached max_ndoc ... 1,004 more documents ]

Tokenization 2: P-N-L-W & Lemmatization

Code
# remove punctuation and numbers
Math_item_tokens2 <- tokens(Math_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

# remove stopwords

Math_item_tokens2 <- tokens_select(Math_item_tokens2,
                   pattern = stopwords("en"),
                  selection = "remove")

# lower case

Math_item_tokens2 <- tokens_tolower(Math_item_tokens2)




print(Math_item_tokens2)
Tokens consisting of 1,010 documents and 12 docvars.
text1 :
 [1] "determine"   "coordinates" "point"       "first"       "quadrant"   
 [6] "create"      "rectangle"   "first"       "three"       "points"     
[11] "rectangle"   "given"      

text2 :
[1] "match"       "numerical"   "expressions" "involve"     "two"        
[6] "operations"  "equivalent"  "word"        "expressions"

text3 :
[1] "solve"       "real-world"  "problem"     "involves"    "multiplying"
[6] "fraction"    "mixed"       "number"     

text4 :
 [1] "determine"   "volume"      "right"       "rectangular" "prisms"     
 [6] "counting"    "unit"        "cubes"       "using"       "volume"     
[11] "solve"       "problems"   

text5 :
[1] "round"     "decimals"  "nearest"   "hundredth"

text6 :
[1] "estimate"  "sum"       "two"       "fractions" "less"      "one"      

[ reached max_ndoc ... 1,004 more documents ]

lemmatization

Code
lem_Math_item_tokens2<-tokens_replace(Math_item_tokens2,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

lem_Math_item_tokens2
Tokens consisting of 1,010 documents and 12 docvars.
text1 :
 [1] "determine"  "coordinate" "point"      "first"      "quadrant"  
 [6] "create"     "rectangle"  "first"      "three"      "point"     
[11] "rectangle"  "give"      

text2 :
[1] "match"      "numerical"  "expression" "involve"    "two"       
[6] "operation"  "equivalent" "word"       "expression"

text3 :
[1] "solve"      "real-world" "problem"    "involve"    "multiply"  
[6] "fraction"   "mix"        "numb"      

text4 :
 [1] "determine"   "volume"      "right"       "rectangular" "prism"      
 [6] "count"       "unit"        "cube"        "use"         "volume"     
[11] "solve"       "problem"    

text5 :
[1] "round"   "decimal" "near"    "100"    

text6 :
[1] "estimate" "sum"      "two"      "fraction" "little"   "one"     

[ reached max_ndoc ... 1,004 more documents ]

Create DFM

Code
df_Math_toks1<-dfm(Math_item_tokens1)


df_Math_toks1_smaller<- dfm_trim(df_Math_toks1, min_docfreq = 0.08, docfreq_type = "prop")


df_Math_toks1
Document-feature matrix of: 1,010 documents, 853 features (98.41% sparse) and 12 docvars.
       features
docs    determine the coordinates of a point in first quadrant that
  text1         1   4           1  2 2     1  1     2        1    1
  text2         0   0           0  0 0     0  0     0        0    1
  text3         0   0           0  0 3     0  0     0        0    1
  text4         1   1           0  1 0     0  0     0        0    0
  text5         0   1           0  0 0     0  0     0        0    0
  text6         0   1           0  1 0     0  0     0        0    1
[ reached max_ndoc ... 1,004 more documents, reached max_nfeat ... 843 more features ]
Code
df_Math_toks1_smaller
Document-feature matrix of: 1,010 documents, 27 features (73.95% sparse) and 12 docvars.
       features
docs    determine the of a in that given . two with
  text1         1   4  2 2  1    1     1 1   0    0
  text2         0   0  0 0  0    1     0 1   1    1
  text3         0   0  0 3  0    1     0 1   0    0
  text4         1   1  1 0  0    0     0 1   0    0
  text5         0   1  0 0  0    0     0 1   0    0
  text6         0   1  1 0  0    1     0 1   1    0
[ reached max_ndoc ... 1,004 more documents, reached max_nfeat ... 17 more features ]
Code
df_Math_toks2<-dfm(lem_Math_item_tokens2)


df_Math_toks_smaller<- dfm_trim(df_Math_toks2, min_docfreq = 0.08, docfreq_type = "prop")


df_Math_toks2
Document-feature matrix of: 1,010 documents, 577 features (98.64% sparse) and 12 docvars.
       features
docs    determine coordinate point first quadrant create rectangle three give
  text1         1          1     2     2        1      1         2     1    1
  text2         0          0     0     0        0      0         0     0    0
  text3         0          0     0     0        0      0         0     0    0
  text4         1          0     0     0        0      0         0     0    0
  text5         0          0     0     0        0      0         0     0    0
  text6         0          0     0     0        0      0         0     0    0
       features
docs    match
  text1     0
  text2     1
  text3     0
  text4     0
  text5     0
  text6     0
[ reached max_ndoc ... 1,004 more documents, reached max_nfeat ... 567 more features ]
Code
df_Math_toks_smaller
Document-feature matrix of: 1,010 documents, 17 features (83.40% sparse) and 12 docvars.
       features
docs    determine give expression two solve real-world problem fraction numb
  text1         1    1          0   0     0          0       0        0    0
  text2         0    0          2   1     0          0       0        0    0
  text3         0    0          0   0     1          1       1        1    1
  text4         1    0          0   0     1          0       1        0    0
  text5         0    0          0   0     0          0       0        0    0
  text6         0    0          0   1     0          0       0        1    0
       features
docs    use
  text1   0
  text2   0
  text3   0
  text4   1
  text5   0
  text6   0
[ reached max_ndoc ... 1,004 more documents, reached max_nfeat ... 7 more features ]

n-grams

Code
 Sys.unsetenv("GITHUB_PAT")
devtools::install_github("slanglab/phrasemachine/R/phrasemachine")
Using GitHub PAT from the git credential store.
Skipping install of 'phrasemachine' from a github remote, the SHA1 (41cec3b7) has not changed since last install.
  Use `force = TRUE` to force installation
Code
#   
library(phrasemachine) 
phrasemachine: Simple Phrase Extraction
Version 1.2.0 created on 2017-05-29.
copyright (c) 2016, Matthew J. Denny, Abram Handler, Brendan O'Connor.
Type help('phrasemachine') or
vignette('getting_started_with_phrasemachine') to get started.
Development website: https://github.com/slanglab/phrasemachine
Code
documents_weak_items<-as.character(corpus_subset(Math_item_corpus, school_state_diff < -2))[26:29]

#documents_weak_items

phrases<-phrasemachine(documents_weak_items, minimum_ngram_length = 2,
                         maximum_ngram_length = 4,
                         return_phrase_vectors = TRUE,
                         return_tag_sequences = TRUE)
Currently tagging document 1 of 4 
Currently tagging document 2 of 4 
Currently tagging document 3 of 4 
Currently tagging document 4 of 4 
Extracting phrases from document 1 of 4 
Extracting phrases from document 2 of 4 
Extracting phrases from document 3 of 4 
Extracting phrases from document 4 of 4 
Code
# look at some example phrases
print(phrases[[1]]$phrases[1:20])
 [1] "Place_parentheses"            "parentheses_in_an_expression"
 [3] "expression_equivalent"        NA                            
 [5] NA                             NA                            
 [7] NA                             NA                            
 [9] NA                             NA                            
[11] NA                             NA                            
[13] NA                             NA                            
[15] NA                             NA                            
[17] NA                             NA                            
[19] NA                             NA                            
Code
documents_weak_items
                                                                                                                                                                                text84 
                                                                                              "Place parentheses in an expression to make the expression equivalent to a given value." 
                                                                                                                                                                                text85 
                                      "Determine the number of cubes that are used to create a figure when the length of one edge of the cube and the volume of the figure are given." 
                                                                                                                                                                                text93 
"Graph a given ordered pair on a coordinate plane, give the ordered pair of a point on a coordinate plane, and interpret coordinate values of points in the context of the situation." 
                                                                                                                                                                                text95 
                                                                                                                                 "Identify equivalent numerical and word expressions." 

Subset: Worst performing Items

Code
Math_worst_item_corpus<- corpus_subset(Math_item_corpus, school_state_diff < 2 )

# remove punctuation and numbers
Math_worst_tokens <- tokens(Math_worst_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

Math_worst_tokens <- tokens_tolower(Math_worst_tokens)


Math_worst_tokens <-  tokens_select(Math_worst_tokens,
                  pattern = stopwords("en"),
                 selection = "remove")

lem_Math_worst_tokens<-tokens_replace(Math_worst_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_Math_worst_tokens<-dfm(lem_Math_worst_tokens)
df_Math_worst_toks_smaller<- dfm_trim(df_Math_worst_tokens, min_docfreq = 0.08, docfreq_type = "prop")

Subset: Best performing Items

Code
Math_best_item_corpus<- corpus_subset(Math_item_corpus, school_state_diff > 5 )

# remove punctuation and numbers
Math_best_tokens <- tokens(Math_best_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

Math_best_tokens <- tokens_tolower(Math_best_tokens)


Math_best_tokens <-  tokens_select(Math_best_tokens,
                  pattern = stopwords("en"),
                 selection = "remove")

lem_Math_best_tokens<-tokens_replace(Math_best_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_Math_best_tokens<-dfm(lem_Math_best_tokens)
df_Math_best_toks_smaller<- dfm_trim(df_Math_best_tokens, min_docfreq = 0.08, docfreq_type = "prop")

Descriptive Analysis

Interestingly, some words like “determine”, “real-world”, and “expression” are top features from the Math tokens as a whole but, in the tokens taken only from the worst performing items, the word expression appears much more frequently than the phrase “real-world”, and this is the opposite in our best performing items and the entire corpus. Also, the word “graph” is a top-feature in the Math corpus and the best performing items but not a top-feature in the worst performing items.

All Items: Top Features

Code
topfeatures(df_Math_toks2, 20)
 determine       give real-world        use expression    problem      solve 
       409        330        278        225        212        206        195 
  equation       numb  represent   identify      value        two    context 
       150        146        143        120        114        113        108 
     graph     number coordinate   fraction       line    involve 
       103        100         94         94         92         77 
Code
#topfeatures(df_Math_toks1, 20)

Worstperforming Items: Top Features

Code
topfeatures(df_Math_worst_toks_smaller, 20)
 determine       give expression real-world        use       numb    problem 
       163        134        115         82         82         72         63 
     solve  represent   identify   equation        two      value equivalent 
        56         53         50         49         48         46         42 
    linear 
        36 

Best performing Items: Top Features

Code
topfeatures(df_Math_best_toks_smaller, 20)
   determine         give   real-world          use        solve      problem 
         173          133          133           94           89           89 
    equation   expression    represent      context        value         numb 
          72           70           63           56           54           51 
       graph          two         line     identify relationship   coordinate 
          51           46           44           43           43           42 
    fraction      involve 
          38           36 

Visualizations

I found it particularly interesting when I saw the connections on the worst performing subset vs. the best performing subset. There appears to be a theme of identifying equivalent expressions that only appears in the worst subset. Solving “real-world” problems; i.e. Modeling, appears strongly in both our best and worst subset. This makes me curious if there are certain mathematical topics where our students are struggling with the modeling problems.

Because of these patterns, I will create a dictionary to classify identify if an item requires Modeling, apply_division, or patterns_algebra.

Word Cloud: All Questions

Code
#dfm_item
smaller_dfm <-dfm_trim(df_Math_toks2, min_termfreq = 10)
smaller_dfm <- dfm_trim(smaller_dfm, min_docfreq = 0.02, docfreq_type = "prop")

textplot_wordcloud(smaller_dfm)

All questions

Clearly, mathematical modeling is a significant skill throughout the Math exam as the network “solve real-world problem” and the connection “real-world context” are very strong.

The word “expression”

Code
### Word connections for all items

dim(df_Math_toks_smaller)
[1] 1010   17
Code
# compute size weight for vertices in network
size <- log(colSums(df_Math_toks_smaller))

# create plot
textplot_network(df_Math_toks_smaller, vertex_size = size/ max(size) * 3)

Worst-performing questions Network Plot

The connection “equivalent expression” and the connections to the word “expression” are much stronger in the network of students’ weakest performing items. Noticeable, the words “context” and “represent” are not present as strong connectors to the word “expression” however they are in the best items network. This signals to me that that the algebraic manipulation of expressions is a weakness rather than constructing an expression in a mathematical modeling context.

The network “solve real-world problem” is also strong in our weakest performing item descriptions but not as heavily emphasized as in our strongest performing items.

Code
### Word connections for worst items

dim(df_Math_worst_toks_smaller)
[1] 401  15
Code
# compute size weight for vertices in network
size <- log(colSums(df_Math_worst_toks_smaller))

# create plot
textplot_network(df_Math_worst_toks_smaller, vertex_size = size/ max(size) * 3)

Best-performing questions

We can see the clear thread of “solve real-world problems” is strongly present in our students’ best performing items’ descriptions.

Also, I can see the connection “determine expression” present as well as “represent” and “context”. These strike me as words connected with constructing models and identifying mathematical concepts in word problems.

Code
### Word connections for best items

dim(df_Math_best_toks_smaller)
[1] 422  20
Code
# compute size weight for vertices in network
size <- log(colSums(df_Math_best_toks_smaller))

# create plot
textplot_network(df_Math_best_toks_smaller, vertex_size = size/ max(size) * 3)

Dictionary to Create New Features

Dictionary: Key Math Terms:

Code
my_math_dict <- dictionary(list(modeling=c("real-world","context", "word_problem"),             
              patterns_alg=c("equivalent expression", "distributive", "identify equivalent", "distribute", "factor", "parentheses"),
              apply_division=c("divide","division","quotient", "quotients")))

my_math_dict
Dictionary object with 3 key entries.
- [modeling]:
  - real-world, context, word_problem
- [patterns_alg]:
  - equivalent expression, distributive, identify equivalent, distribute, factor, parentheses
- [apply_division]:
  - divide, division, quotient, quotients
Code
# patterns_alg=c("equivalent expression","radical expression", "rational expression", #"expression", "eqivalent", "distributive", "distribute", "factor")
Code
math_Toks1_mydict <- df_Math_toks1 %>%
  dfm_lookup(my_math_dict)

tail(math_Toks1_mydict, 10)
Document-feature matrix of: 10 documents, 3 features (86.67% sparse) and 12 docvars.
          features
docs       modeling patterns_alg apply_division
  text1001        0            0              0
  text1002        0            0              0
  text1003        1            0              0
  text1004        1            0              0
  text1005        0            0              0
  text1006        0            0              0
[ reached max_ndoc ... 4 more documents ]
Code
df_MathDict <- convert(math_Toks1_mydict, to = "data.frame")



df_MathDict
Code
#Math_DF<-left_join(df_MathDict, Math_DF, by = "doc_id")
Code
# mathToksDFM_mydict <- df_Math_worst_tokens %>%
#   dfm_lookup(my_math_dict)
# 
# head(mathToksDFM_mydict, 10)
# 
# summary(df_Math_toks1)
Code
Math_DF<-mutate(Math_DF, ID = as.character(row_number()))

Math_DF<-mutate(Math_DF, doc_id = str_trim(paste("text", ID, sep="")))

Math_DF<-left_join(df_MathDict, Math_DF, by = "doc_id")
Code
Math_DF<-Math_DF%>%
  mutate(modeling_factor = case_when(
    modeling == 0 ~ "no",
    modeling > 0 ~ "yes"
  ))%>%
  mutate(patterns_alg_factor = case_when(
    patterns_alg == 0 ~ "no",
    patterns_alg > 0 ~ "yes"
  ))%>%
  mutate(apply_division_factor = case_when(
    apply_division == 0 ~ "no",
    apply_division > 0 ~ "yes"
  ))

Modeling Visualization

Code
Math_DF%>%
  #filter(modeling > 0)%>%
  filter(grade_level == 10)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`modeling_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G10 MCAS: Student Performance Modeling by Reporting Category")

G5 Modeling

Code
Math_DF%>%
  #filter(modeling > 0)%>%
  filter(grade_level == 5)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`modeling_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G5 MCAS: Student Performance Modeling by Reporting Category")

Code
Math_DF%>%
  filter(grade_level == 5)%>%
  filter(modeling_factor == "yes")%>%
  filter(`reporting category` == "GE")

G6 Modeling

Code
Math_DF%>%
  #filter(modeling > 0)%>%
  filter(grade_level == 6)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`modeling_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G6 MCAS: Student Performance Modeling by Reporting Category")

Code
#Math_DF
Math_DF%>%
  filter(grade_level == 6)%>%
  filter(modeling_factor == "yes")%>%
  filter(`reporting category` == "RP")%>%
  select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
  arrange(`school_state_diff`)
Code
Math_DF%>%
  filter(grade_level == 6)%>%
  filter(modeling_factor == "yes")%>%
  filter(`reporting category` == "GE")%>%
  select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
  arrange(`school_state_diff`)
Code
# Math_DF%>%
#   filter(grade_level == 6)%>%
#   filter(modeling_factor == "yes")%>%
#   filter(`reporting category` == "SP")%>%
#   select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
#   arrange(`school_state_diff`)
#   

G7 Modeling

Code
Math_DF%>%
  #filter(modeling > 0)%>%
  filter(grade_level == 7)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`modeling_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G7 MCAS: Student Performance Modeling by Reporting Category")

Code
Math_DF
Code
Math_DF%>%
  filter(grade_level == 7)%>%
  filter(modeling_factor == "yes")%>%
  filter(`reporting category` == "EE")%>%
  select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
  arrange(`school_state_diff`)

G8 Modeling

Code
Math_DF%>%
  #filter(modeling > 0)%>%
  filter(grade_level == 8)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`modeling_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G8 MCAS: Student Performance Modeling by Reporting Category")

Code
Math_DF%>%
  filter(grade_level == 8)%>%
  filter(modeling_factor == "no")%>%
  filter(`reporting category` == "SP")%>%
  select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
  arrange(`school_state_diff`)

Patterns_Alg Visual

The concept of equivalent expressions, the distributive property, and factoring seems to be emphasized and particularly challenging for our students in grades 6, 7, and 10. It is not heavily emphasized on the G8 exam.

Code
Math_DF%>%
  #filter(patterns_alg > 0)%>%
  filter(grade_level == 6 | grade_level == 7 | grade_level == 10)%>%
  ggplot( aes(x=`patterns_alg_factor`, y=`school_state_diff`, fill=`patterns_alg_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G6, G7, G10 MCAS: Student Performance Patterns Algebra")#+

Code
 #facet_wrap(~`reporting category`)

G6

Code
Math_DF%>%
  #filter(patterns_alg > 0)%>%
  filter(grade_level == 6)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`patterns_alg_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G6 MCAS: Student Performance Patterns Algebra and Reporting Category")

Code
Math_DF%>%
  filter(grade_level == 6)%>%
  filter(patterns_alg_factor == "yes")%>%
 # filter(`reporting category` == "SP")%>%
  select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
  arrange(`school_state_diff`)

Apply Division Visual

G6

G6 is the when students are expected to be fluent in the standard algorithm for long division. Applying division seems to have been a factor in student success in some G6 reporting categories

Code
Math_DF%>%
  #filter(patterns_alg > 0)%>%
  filter(grade_level == 6)%>%
  ggplot( aes(x=`apply_division_factor`, y=`school_state_diff`, fill=`apply_division_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G6 MCAS: Student Performance Apply Division")#+

Code
 #facet_wrap(~`grade_level`)
Code
Math_DF%>%
  filter(grade_level == 6)%>%
  filter(`reporting category` == "NS" | `reporting category` == "EE")%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`apply_division_factor`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("Math G6 MCAS: Student Performance Apply_Division and Reporting Category")

Code
Math_DF%>%
  filter(grade_level == 6)%>%
  filter(apply_division_factor == "yes")%>%
  filter(patterns_alg_factor == "no")%>%
 # filter(`reporting category` == "SP")%>%
  select(`standard`, `item description`, `school_state_diff`, `year`, item_library_URL)%>%
  arrange(`school_state_diff`)

Machine Learning Model to Predict school_state_diff

Create Training Set

Code
#Math_DF
Math_DF_Train_Full<-Math_DF%>%
  mutate(year = as.factor(year))%>%
  mutate(`school_state_bool` = case_when(
    `school_state_diff` <= 0 ~ "negative",
    `school_state_diff` > 0 ~ "positive")
  )%>%
  #filter(grade_level == 5)%>%
  filter(grade_level == 10 | grade_level == 8| grade_level == 7| grade_level == 6|grade_level == 5)%>%
 mutate(id = row_number())%>%
  select(`doc_id`, `id`, `standard`, `reporting category`, `type`, `year`,  `modeling_factor`, `patterns_alg_factor`, `apply_division_factor`, `state%`, `school_state_bool`, `school_state_diff`, `grade_level`)

Math_DF_Train_Full

Create Training Set

Code
# set seed
set.seed(12345)

N<-nrow(Math_DF_Train_Full)

trainIndex <- sample(1:N, .8*N)

testIndex<-c(1:N)[-trainIndex]

# check length of training set
length(trainIndex)
[1] 808
Code
trainIndexDF<-trainIndex

#Math_DF_Test<-Math_DF_Train_Full[-Math_DF_Train]


trainIndexDF<-as.data.frame(t(trainIndex))
dim(trainIndexDF)
[1]   1 808
Code
#dim(Math_DF_Test)

#Create training vector

trainIndexDF
Code
trainIndexDF<-trainIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#trainIndexDF<-mutate(trainIndexDF, doc_id = str_trim(paste("text", ID, #sep="")))%>%
#  select("doc_id")

trainIndexDF
Code
Math_DF_Train_Full
Code
Math_DF_Train<-left_join(trainIndexDF, Math_DF_Train_Full, by = "id")

Math_DF_Train

Create Test Set

Code
testIndexDF<-testIndex

#Math_DF_Test<-Math_DF_Test_Full[-Math_DF_Test]


testIndexDF<-as.data.frame(t(testIndex))
dim(testIndexDF)
[1]   1 202
Code
#dim(Math_DF_Test)

#Create testing vector

#testIndexDF

testIndexDF<-testIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#testIndexDF<-mutate(testIndexDF, doc_id = str_trim(paste("text", ID, sep="")))%>%
#  select("doc_id")

#Math_DF_Test_Full

Math_DF_Test<-left_join(testIndexDF, Math_DF_Train_Full, by = "id")

Random Forest Model

Base Regression

Cross Validation

Random forest with text features

Code
data_ctrl<-trainControl(method = "cv", number = 5)

Math_DF_Train_Full_Text<-Math_DF_Train_Full%>%
  select(`standard`, `reporting category`, `type`, `year`, `modeling_factor`, `state%`, `grade_level`, `school_state_diff` )

Math_DF_Train_Full_Base<-Math_DF_Train_Full%>%
  select(`standard`, `reporting category`, `type`, `year`, `state%`, `grade_level`, `school_state_diff` )
Code
model_caret_rf_text<-train(school_state_diff ~.
                      , data = Math_DF_Train_Full_Text,
                      trControl = data_ctrl,
                      method = "rf")
Code
model_caret_rf_text
Random Forest 

1010 samples
   7 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 808, 808, 808, 808, 808 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared    MAE     
    2   8.729335  0.02648748  6.918423
  104   8.779073  0.05711325  6.853889
  206   8.778752  0.05913127  6.860164

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
Code
varImpPlot(model_caret_rf_text$finalModel)

Code
model_caret_rf<-train(school_state_diff ~.
                      , data = Math_DF_Train_Full_Base,
                      trControl = data_ctrl,
                      method = "rf")
Code
model_caret_rf
Random Forest 

1010 samples
   6 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 807, 808, 809, 808, 808 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared    MAE     
    2   8.745757  0.02236964  6.930960
  103   8.699142  0.06545570  6.794238
  205   8.717517  0.06691320  6.824248

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 103.
Code
Math_DF_Train_X<-Math_DF_Train%>%
  select( `standard`, `reporting category`, `type`, `year`,  `state%`, grade_level)



Math_DF_Train_Y<-Math_DF_Train%>%
  select(`school_state_bool`)

Math_DF_Train_Y
Code
Math_DF_Test_X<-Math_DF_Test%>%
  select( `standard`, `reporting category`, `type`, `year`,  `state%`, grade_level)

Math_DF_Test_Y<-Math_DF_Test%>%
  select(`school_state_diff`)


#Math_DF_Test_Y
#Math_DF_Train_Y

#Math_DF_Train_X

#Math_DF_Test_X
Code
set.seed(444)

diff_RF<-randomForest(Math_DF_Train_X,
                      y = Math_DF_Train$school_state_diff,
                      xtest = Math_DF_Test_X,
                      ytest = Math_DF_Test$school_state_diff,
                      mtry =103,
                      importance = TRUE,
                      ntree = 1010,
                      type="regression")
Warning in randomForest.default(Math_DF_Train_X, y =
Math_DF_Train$school_state_diff, : invalid mtry: reset to within valid range
Code
diff_RF

Call:
 randomForest(x = Math_DF_Train_X, y = Math_DF_Train$school_state_diff,      xtest = Math_DF_Test_X, ytest = Math_DF_Test$school_state_diff,      ntree = 1010, mtry = 103, importance = TRUE, type = "regression") 
               Type of random forest: regression
                     Number of trees: 1010
No. of variables tried at each split: 6

          Mean of squared residuals: 76.37658
                    % Var explained: 7.4
                       Test set MSE: 56.63
                    % Var explained: 3.09

Regression w/ text

Code
Math_DF_Train_X<-Math_DF_Train%>%
  select( `standard`, `reporting category`, `type`, `year`, `state%`, `modeling_factor`, `patterns_alg_factor`, grade_level)



Math_DF_Train_Y<-Math_DF_Train%>%
  select(`school_state_diff`)

Math_DF_Train_Y
Code
Math_DF_Test_X<-Math_DF_Test%>%
  select( `standard`, `reporting category`, `type`, `year`, `state%`, `modeling_factor`, `patterns_alg_factor`, grade_level)

Math_DF_Test_Y<-Math_DF_Test%>%
  select(`school_state_diff`)
Code
set.seed(444)

diff_RF_text<-randomForest(Math_DF_Train_X,
                      y = Math_DF_Train$school_state_diff,
                      xtest = Math_DF_Test_X,
                      ytest = Math_DF_Test$school_state_diff,
                      mtry = 2,
                      importance = TRUE,
                      ntree = 1010,
                      type="regression")
Code
diff_RF_text

Call:
 randomForest(x = Math_DF_Train_X, y = Math_DF_Train$school_state_diff,      xtest = Math_DF_Test_X, ytest = Math_DF_Test$school_state_diff,      ntree = 1010, mtry = 2, importance = TRUE, type = "regression") 
               Type of random forest: regression
                     Number of trees: 1010
No. of variables tried at each split: 2

          Mean of squared residuals: 73.52679
                    % Var explained: 10.86
                       Test set MSE: 53.68
                    % Var explained: 8.15
Code
varImpPlot(diff_RF_text)

Support Vector Regression

to do address the issue that new standards appear in the test set and thus the svm can’t make a prediction

Code
set.seed(446)

Math_SVM_Train<-Math_DF_Train%>%
  select(`standard`, `reporting category`, `type`, `year`, `state%`, `modeling_factor`, `patterns_alg_factor`, `apply_division_factor`, grade_level, school_state_diff)
#Math_DF_Train_X
#Math_DF_Train$school_state_diff

svm_text<-svm(school_state_diff ~., data = Math_SVM_Train)
                     
svm_text

Call:
svm(formula = school_state_diff ~ ., data = Math_SVM_Train)


Parameters:
   SVM-Type:  eps-regression 
 SVM-Kernel:  radial 
       cost:  1 
      gamma:  0.004854369 
    epsilon:  0.1 


Number of Support Vectors:  744
Code
#svm_pred<-predict(svm_text, Math_DF_Test_X)

Science, Technology, and Engineering Exam (STE) Exam

Create Corpus

Code
STE_item_corpus <-corpus(STE_DF, text_field = "item description")

#print(STE_item)

#summary(STE_item_corpus)

#STE_DF

PreTextPreProcessing Decisions

Before completing the pre-processing process, I examined the different choices using pre-Text. I was surprised to see that removing stopwords had a positive correlation coefficient; yet the combination “P-W” had the lowest score of the Pre-text results. I can see how key logical words like “not”, “and”, and “or”, which are also stop words can have a significant impact on the meaning of an exam question. Perhaps, because each individual text is so small and the texts are designed for assessing content skills and are not narrative text, the stop words play more significant roles?

Given, these results, I will pre-process the data two ways, once using the lowest recommended score, P-W and once using methods that includes, lowercase and lemmatization of words to use for creating visualizations

  • Recommended preText score: “P-W” (remove punctuation and stopwords)

  • Alternative approach: “P-N-L-W” + Lemmatization (remove punctuation, remove numbers, lower case, remove stopwords and lemmatization)

Code
# Sys.unsetenv("GITHUB_PAT")
# devtools::install_github("matthewjdenny/preText")
# library(preText)
Code
# 
# preprocessed_documents_STE <- factorial_preprocessing(
#     STE_item_corpus,
#     use_ngrams = TRUE,
#     infrequent_term_threshold = 0.2,
#     verbose = FALSE)
Code
#names(preprocessed_documents_STE)

#head(preprocessed_documents_STE$choices)
Code
# preText_results <- preText(
#     preprocessed_documents_STE,
#     dataset_name = "STE MCAS Item Descriptions",
#     distance_method = "cosine",
#     num_comparisons = 20,
#     verbose = FALSE)
# 
# 
# preText_score_plot(preText_results)
Code
# regression_coefficient_plot(preText_results,
#                             remove_intercept = TRUE)

Tokenization 1: P-W

Code
## Extract the tokens

STE_item_tokens <- tokens(STE_item_corpus)

#print(STE_item_tokens)
Code
STE_item_tokens1 <- tokens(STE_item_corpus, 
    remove_punct = T)

STE_item_tokens1 <- tokens_select(STE_item_tokens1,
                   pattern = stopwords("en"),
                  selection = "remove")

print(STE_item_tokens1)
Tokens consisting of 521 documents and 13 docvars.
text1 :
[1] "Describe"    "material"    "appropriate" "choice"      "part"       
[6] "bridge"     

text2 :
[1] "Describe" "change"   "particle" "model"    "show"     "atoms"    "thermal" 
[8] "energy"  

text3 :
[1] "Relate"    "amount"    "energy"    "seismic"   "wave"      "wave's"   
[7] "amplitude"

text4 :
 [1] "Analyze"      "model"        "bridge"       "determine"    "loads"       
 [6] "system"       "describe"     "advantage"    "disadvantage" "making"      
[11] "design"       "change"      

text5 :
 [1] "Interpret"  "model"      "two"        "oceanic"    "plates"    
 [6] "colliding"  "identify"   "geological" "feature"    "formed"    
[11] "describe"   "process"   
[ ... and 3 more ]

text6 :
 [1] "Complete"     "table"        "showing"      "percentages"  "genotypes"   
 [6] "cross"        "two"          "plants"       "heterozygous" "particular"  
[11] "trait"       

[ reached max_ndoc ... 515 more documents ]

Tokenization 2: P-N-L-W- Lemmatization

Code
# remove punctuation and numbers
STE_item_tokens2 <- tokens(STE_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

STE_item_tokens2 <- tokens_select(STE_item_tokens2,
                   pattern = stopwords("en"),
                  selection = "remove")

# lower case

STE_item_tokens2 <- tokens_tolower(STE_item_tokens2)


# remove stopwords

#print(STE_item_tokens2)

lemmatization

When I originally made word clouds, I noticed object and objects appearing separately as well as model and models. I believe these are important, so I chose to also do lemmatization for creating visual summaries.

Code
lem_STE_item_tokens2<-tokens_replace(STE_item_tokens2,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

lem_STE_item_tokens2
Tokens consisting of 521 documents and 13 docvars.
text1 :
[1] "describe"    "material"    "appropriate" "choice"      "part"       
[6] "bridge"     

text2 :
[1] "describe" "change"   "particle" "model"    "show"     "atom"     "thermal" 
[8] "energy"  

text3 :
[1] "relate"    "amount"    "energy"    "seismic"   "wave"      "wave's"   
[7] "amplitude"

text4 :
 [1] "analyze"      "model"        "bridge"       "determine"    "load"        
 [6] "system"       "describe"     "advantage"    "disadvantage" "make"        
[11] "design"       "change"      

text5 :
 [1] "interpret"  "model"      "two"        "oceanic"    "plate"     
 [6] "collide"    "identify"   "geological" "feature"    "form"      
[11] "describe"   "process"   
[ ... and 3 more ]

text6 :
 [1] "complete"     "table"        "show"         "percentage"   "genotype"    
 [6] "cross"        "two"          "plant"        "heterozygous" "particular"  
[11] "trait"       

[ reached max_ndoc ... 515 more documents ]

Create DFMs

n-grams

When I examine the phrases in the first ten documents of weak items, the weakest documents seem to consistently involve the analysis of graphs/diagrams/models. I would like to capture this as a text feature.

Code
library(phrasemachine) 
documents_weak_items<-as.character(corpus_subset(STE_item_corpus, school_state_diff < -2))[1:10]

#documents_weak_items

phrases<-phrasemachine(documents_weak_items, minimum_ngram_length = 2,
                         maximum_ngram_length = 4,
                         return_phrase_vectors = TRUE,
                         return_tag_sequences = TRUE)
Currently tagging document 1 of 10 
Currently tagging document 2 of 10 
Currently tagging document 3 of 10 
Currently tagging document 4 of 10 
Currently tagging document 5 of 10 
Currently tagging document 6 of 10 
Currently tagging document 7 of 10 
Currently tagging document 8 of 10 
Currently tagging document 9 of 10 
Currently tagging document 10 of 10 
Extracting phrases from document 1 of 10 
Extracting phrases from document 2 of 10 
Extracting phrases from document 3 of 10 
Extracting phrases from document 4 of 10 
Extracting phrases from document 5 of 10 
Extracting phrases from document 6 of 10 
Extracting phrases from document 7 of 10 
Extracting phrases from document 8 of 10 
Extracting phrases from document 9 of 10 
Extracting phrases from document 10 of 10 
Code
# look at some example phrases
print(phrases[[1]]$phrases[1:20])
 [1] "Analyze_particle"        "Analyze_particle_models"
 [3] "particle_models"         "models_of_two_gas"      
 [5] "two_gas"                 "two_gas_samples"        
 [7] "gas_samples"             NA                       
 [9] NA                        NA                       
[11] NA                        NA                       
[13] NA                        NA                       
[15] NA                        NA                       
[17] NA                        NA                       
[19] NA                        NA                       
Code
documents_weak_items
                                                                                                                                     text9 
                                        "Analyze particle models of two gas samples and determine that the samples differ in temperature." 
                                                                                                                                    text12 
                                                           "Analyze force models to determine whether the speeds of objects are changing." 
                                                                                                                                    text13 
      "Describe how a change in the amount of Arctic ice in polar regions is most likely to affect a species that lives in these regions." 
                                                                                                                                    text20 
                                                                "Determine why air on Earth circulates between the equator and the poles." 
                                                                                                                                    text30 
                                        "Analyze a graph to determine a result of changes in atmospheric carbon dioxide levels over time." 
                                                                                                                                    text43 
                                                "Analyze a diagram to determine when an object has the greatest amount of kinetic energy." 
                                                                                                                                    text50 
                                                  "Interpret a plate tectonic model to determine that crust will melt as it is subducted." 
                                                                                                                                    text51 
"Interpret plate tectonic models to determine which model shows an ocean is getting larger and determine why the ocean is getting larger." 
                                                                                                                                    text60 
                                                                      "Determine that a pair of chemical models represents two compounds." 
                                                                                                                                    text66 
                                                                  "Determine the mass of products after a chemical reaction has occurred." 
Code
df_STE_toks2<-dfm(lem_STE_item_tokens2)

df_STE_toks1<-dfm(STE_item_tokens1)

Worst performing Items

Code
STE_worst_item_corpus<- corpus_subset(STE_item_corpus, school_state_diff < 2 )

# remove punctuation and numbers
STE_worst_tokens <- tokens(STE_worst_item_corpus, 
    remove_punct = T,
    remove_numbers = T)



STE_worst_tokens<-STE_worst_tokens <-tokens_tolower(STE_worst_tokens)

STE_worst_tokens<- tokens_select(STE_worst_tokens, 
                     pattern = stopwords("en"),
                     selection = "remove")

lem_STE_worst_tokens<-tokens_replace(STE_worst_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_STE_worst_toks<-dfm(lem_STE_worst_tokens)

df_STE_worst_toks_smaller<-dfm_trim(df_STE_worst_toks, min_docfreq = .05, docfreq_type = "prop")

MS only

Subset: Worst Items

Code
STE_MS_item_corpus<- corpus_subset(STE_item_corpus, grade_level != 9 )

#summary(STE_MS_item_corpus)


STE_MS_worst_item_corpus<-corpus_subset(STE_MS_item_corpus, school_state_diff < 2)

# remove punctuation and numbers
STE_MS_worst_tokens <- tokens(STE_MS_worst_item_corpus, 
    remove_punct = T,
    remove_numbers = T)


STE_MS_worst_tokens<-STE_worst_tokens <-tokens_tolower(STE_MS_worst_tokens)

STE_MS_worst_tokens<- tokens_select(STE_MS_worst_tokens, 
                     pattern = stopwords("en"),
                     selection = "remove")


lem_STE_MS_worst_tokens<-tokens_replace(STE_MS_worst_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)



df_STE_MS_worst_toks<-dfm(lem_STE_MS_worst_tokens)

df_STE_MS_worst_toks_smaller<-dfm_trim(df_STE_worst_toks, min_docfreq = .05, docfreq_type = "prop")

Subset: Best Items

Code
STE_MS_best_item_corpus<-corpus_subset(STE_MS_item_corpus, school_state_diff > 5)


# remove punctuation and numbers
STE_MS_best_tokens <- tokens(STE_MS_best_item_corpus, 
    remove_punct = T,
    remove_numbers = T)


STE_MS_best_tokens<-STE_best_tokens <-tokens_tolower(STE_MS_best_tokens)

STE_MS_best_tokens<- tokens_select(STE_MS_best_tokens, 
                     pattern = stopwords("en"),
                     selection = "remove")


lem_STE_MS_best_tokens<-tokens_replace(STE_MS_best_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)



df_STE_MS_best_toks<-dfm(lem_STE_MS_best_tokens)

df_STE_MS_best_toks_smaller<-dfm_trim(df_STE_MS_best_toks, min_docfreq = .06, docfreq_type = "prop")

#textplot_wordcloud(df_STE_MS_best_toks)
#textplot_wordcloud(df_STE_MS_best_toks_smaller)

Descriptive Analysis

When I examine the top features, it is notable that the word graph appears as a top feature of the worst items but not in the best performing items. The words analyze and model appear in both the worst and best performing top features; however, words that are associated with physical sciences, like force appear in the worst items and words associated with life sciences, like organism, plant, and population appear only in the best performing top features.

All Items: Top Features

Code
topfeatures(df_STE_toks2, 20)
determine  describe    object    change   explain       use     model  identify 
      222       133       125        98        97        95        91        85 
   energy   analyze     force       two    system      show interpret different 
       77        77        75        65        60        54        53        48 
     wave     graph      time   compare 
       45        45        44        43 
Code
#topfeatures(df_Math_toks1, 20)

Worstperforming Items: Top Features

Code
STE_worst_item_corpus<- corpus_subset(STE_item_corpus, school_state_diff < 0 )

# remove punctuation and numbers
STE_worst_tokens <- tokens(STE_worst_item_corpus, 
    remove_punct = T,
    remove_numbers = T)



STE_worst_tokens<-STE_worst_tokens <-tokens_tolower(STE_worst_tokens)

STE_worst_tokens<- tokens_select(STE_worst_tokens, 
                     pattern = stopwords("en"),
                     selection = "remove")

lem_STE_worst_tokens<-tokens_replace(STE_worst_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_STE_worst_toks<-dfm(lem_STE_worst_tokens)

df_STE_worst_toks_smaller<-dfm_trim(df_STE_worst_toks, min_docfreq = .06, docfreq_type = "prop")
Code
topfeatures(df_STE_worst_toks_smaller, 20)
determine    object    change  describe    energy     model     force   analyze 
       73        51        39        37        34        26        26        24 
calculate   explain interpret       two       use  identify   diagram     graph 
       24        23        22        20        19        18        18        16 
  compare      time    design     datum 
       15        14        14        14 

Best performing Items: Top Features

Code
STE_best_item_corpus<- corpus_subset(STE_item_corpus, school_state_diff > 2 )

# remove punctuation and numbers
STE_best_tokens <- tokens(STE_best_item_corpus, 
    remove_punct = T,
    remove_numbers = T)



STE_best_tokens<-STE_best_tokens <-tokens_tolower(STE_best_tokens)

STE_best_tokens<- tokens_select(STE_best_tokens, 
                     pattern = stopwords("en"),
                     selection = "remove")


lem_STE_best_tokens<-tokens_replace(STE_best_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_STE_best_toks<-dfm(lem_STE_best_tokens)

df_STE_best_toks_smaller<-dfm_trim(df_STE_best_toks, min_docfreq = .08, docfreq_type = "prop")
Code
topfeatures(df_STE_best_toks_smaller, 20)
determine  describe       use   explain    object  identify     model    system 
      123        76        71        64        54        51        48        47 
   change   analyze     force      show       two    energy different 
       46        43        42        35        35        33        31 

MS Worst: Top Features

Code
topfeatures(df_STE_MS_worst_toks_smaller, 20)
determine    object  describe    change     model    energy   analyze     force 
       96        68        54        44        41        39        34        32 
interpret       two calculate   explain  identify     graph      time       use 
       30        30        30        29        28        25        24        22 
  diagram   compare     datum different 
       19        19        18        17 

MS Best: Top Features

Code
topfeatures(df_STE_MS_best_toks_smaller, 20)
   determine          use     describe      explain      analyze        model 
          84           58           48           46           34           34 
    identify       system    different     organism       change         show 
          34           32           26           26           24           22 
         two       object        plant   population         part       energy 
          22           22           20           20           18           18 
relationship          can 
          18           18 

Visualizations

It was evident that the worst item tokens seem to largely represent text of items in the physical sciences. Our 9th grade students take a science exam that is only on Physics and as you can see below, they have performed weaker against their peers in the state compared to Rising Tide students at other grade levels. It would therefore, be worth exploring the worst items for Middle School (5th-8th grade) students. When I look at the Mifdle School student performance by Reporting Category, one of their weakest categories appears to be Physical Sciences. So this is also contributing to the dominance of Physical Science terms in the worst items tokens.

Code
#STE_DF

STE_DF%>%
  filter(grade_level != 9)%>%
  group_by(`reporting category`)%>%
  summarize(mean_diff = mean(school_state_diff))
Code
STE_DF%>%
  #filter(grade_level != 9)%>%
  #filter(Course == "STEM"| Course == "Math" & `Reporting Category` == "Geometry")%>%
  #filter(`Cluster` != "Convert like measurement units within a given measurement system.")%>%
  #filter(`Cluster` != "Geometric measurement: Understand concepts of volume and relate volume to multiplication and to addition.")%>%
  #filter(`year` == 2024 | year == 2019)%>%
  ggplot( aes(x=`grade_level`, y=`school_state_diff`, fill=`grade_level`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("STE MCAS: Student Performance by Grade Level") +
    xlab("")#+

Code
STE_DF%>%
  filter(grade_level != 9)%>%
  #filter(`year` == 2024 | year == 2019)%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`reporting category`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("MS STE MCAS: Student Performance by Reporting Category") +
    xlab("")#+

Code
STE_DF%>%
  filter(grade_level != 9)%>%
  #filter(Course == "STEM"| Course == "Math" & `Reporting Category` == "Geometry")%>%
  filter(`reporting category` == "PS")%>%
  #filter(`Cluster` != "Geometric measurement: Understand concepts of volume and relate volume to multiplication and to addition.")%>%
  #filter(`year` == 2024 | year == 2019)%>%
  ggplot( aes(x=`discipline core idea`, y=`school_state_diff`, fill=`discipline core idea`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
     # legend.position="none",
      axis.text.x=element_blank(),
      plot.title = element_text(size=11)
    ) +
    ggtitle("MS STE MCAS: Student Performance in Physical Sciences") +
    xlab("")#+

Word Cloud

The word cloud has Physical science words larger than other subjects. This also suggests a need to explore the Middle School separately as the 5-8th grade exams give equal weight to Physical Sciences, Life Science, Earth Science, and Technology and Engineering.

Code
#dfm_item


textplot_wordcloud(df_STE_toks2)

Code
smaller_dfm <-dfm_trim(df_STE_toks2, min_termfreq = 10)
smaller_dfm <- dfm_trim(smaller_dfm, min_docfreq = 0.07, docfreq_type = "prop")

textplot_wordcloud(smaller_dfm, min_count =3, random_order = FALSE)

Examine text visuals on worst-performing questions

Code
textplot_wordcloud(df_STE_worst_toks)

Code
textplot_wordcloud(df_STE_worst_toks_smaller)

Examine text visual on best-performing questions

Code
textplot_wordcloud(df_STE_best_toks)

Code
textplot_wordcloud(df_STE_best_toks_smaller)

Word Connections

Worst items

Code
dim(df_STE_worst_toks_smaller)
[1] 148  31
Code
# compute size weight for vertices in network
size <- log(colSums(df_STE_worst_toks_smaller))

# create plot
textplot_network(df_STE_worst_toks_smaller, vertex_size = size/ max(size) * 3)

Best Items

Code
# check the dimensions (i.e., the number of rows and the number of columnns) of the matrix we created

dim(df_STE_best_toks_smaller)
[1] 301  15
Code
# compute size weight for vertices in network
size <- log(colSums(df_STE_best_toks_smaller))

# create plot
textplot_network(df_STE_best_toks_smaller, vertex_size = size/ max(size) * 3)

Word connections for all items

Code
# check the dimensions (i.e., the number of rows and the number of columnns) of the matrix we created
smaller_fcm <- fcm(smaller_dfm)
dim(df_STE_best_toks_smaller)
[1] 301  15
Code
# compute size weight for vertices in network
size <- log(colSums(smaller_fcm))

# create plot
textplot_network(smaller_dfm, vertex_size = size/ max(size) * 3)

MS Worst

Code
# check the dimensions (i.e., the number of rows and the number of columnns) of the matrix we created
#smaller_fcm <- fcm(smaller_dfm)
dim(df_STE_best_toks_smaller)
[1] 301  15
Code
# compute size weight for vertices in network
size <- log(colSums(smaller_fcm))

# create plot
textplot_network(smaller_dfm, vertex_size = size/ max(size) * 3)

MS Best

Code
#df_STE_MS_best_toks_smaller

dim(df_STE_MS_best_toks_smaller)
[1] 206  26
Code
# compute size weight for vertices in network
size <- log(colSums(df_STE_MS_best_toks_smaller))

# create plot
textplot_network(df_STE_MS_best_toks_smaller, vertex_size = size/ max(size) * 3)

Dictionary to Create New Features

Dictionary: Key STE Terms:

Code
my_ste_dict <- dictionary(list(model=c("Analyze force model", "Analyze particle models", "Analyze a graph", "Analyze a diagram", "Interpret a graph", "model", "models", "diagram", "diagram", "graph", "graphs"),             
              algebra=c("calculate")))

my_ste_dict
Dictionary object with 2 key entries.
- [model]:
  - analyze force model, analyze particle models, analyze a graph, analyze a diagram, interpret a graph, model, models, diagram, graph, graphs
- [algebra]:
  - calculate
Code
# patterns_alg=c("equivalent expression","radical expression", "rational expression", #"expression", "eqivalent", "distributive", "distribute", "factor")
Code
STE_Toks1_mydict <- df_STE_toks1 %>%
  dfm_lookup(my_ste_dict)

tail(STE_Toks1_mydict, 10)
Document-feature matrix of: 10 documents, 2 features (75.00% sparse) and 13 docvars.
         features
docs      model algebra
  text512     0       0
  text513     0       0
  text514     0       0
  text515     0       0
  text516     1       1
  text517     0       0
[ reached max_ndoc ... 4 more documents ]
Code
df_STEDict <- convert(STE_Toks1_mydict, to = "data.frame")



df_STEDict
Code
#Math_DF<-left_join(df_MathDict, Math_DF, by = "doc_id")
Code
# mathToksDFM_mydict <- df_Math_worst_tokens %>%
#   dfm_lookup(my_math_dict)
# 
# head(mathToksDFM_mydict, 10)
# 
# summary(df_Math_toks1)
Code
STE_DF<-mutate(STE_DF, ID = as.character(row_number()))

STE_DF<-mutate(STE_DF, doc_id = str_trim(paste("text", ID, sep="")))

STE_DF<-left_join(df_STEDict, STE_DF, by = "doc_id")

STE_DF
Code
STE_DF<-STE_DF%>%
  mutate(algebra_factor = case_when(
    algebra == 0 ~ "no",
    algebra > 0 ~ "yes"
  ))%>%
  mutate(model_factor = case_when(
    model == 0 ~ "no",
    model > 0 ~ "yes"
  ))

Machine Learning Training Set

Code
STE_DF
Code
STE_DF_Train_Full<-STE_DF%>%
   mutate(year = as.factor(year))%>%
#   mutate(`school_state_bool` = case_when(
#     `school_state_diff` <= 0 ~ "negative",
#     `school_state_diff` > 0 ~ "positive")
#   )%>%
#  filter(grade_level == 10)%>%
#   filter(grade_level == 10 | grade_level == 8| grade_level == 7| grade_level == 6|grade_level == 5)%>%
  mutate(id = row_number())%>%
   select(`id`, `standard`, `reporting category`, `discipline core idea`, `pts`, `year`, `type`, `state%`, `school_state_diff`, `grade_level`, model_factor, algebra_factor)
# 
 STE_DF_Train_Full

Create Training Set

Code
# set seed
set.seed(12345)

N<-nrow(STE_DF_Train_Full)

trainIndex <- sample(1:N, .8*N)

testIndex<-c(1:N)[-trainIndex]

# check length of training set
length(trainIndex)
[1] 416
Code
trainIndexDF<-trainIndex

#Math_DF_Test<-Math_DF_Train_Full[-Math_DF_Train]


trainIndexDF<-as.data.frame(t(trainIndex))
dim(trainIndexDF)
[1]   1 416
Code
#dim(Math_DF_Test)

#Create training vector

trainIndexDF
Code
trainIndexDF<-trainIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#trainIndexDF<-mutate(trainIndexDF, doc_id = str_trim(paste("text", ID, #sep="")))%>%
#  select("doc_id")

#trainIndexDF

STE_DF_Train_Full
Code
STE_DF_Train<-left_join(trainIndexDF, STE_DF_Train_Full, by = "id")

STE_DF_Train
Code
STE_DF_Train_X<-STE_DF_Train%>%
  select(`standard`, `reporting category`, `discipline core idea`, `pts`, `year`, `type`, `state%`, `grade_level`, model_factor, algebra_factor)

STE_DF_Train_Base_X<-STE_DF_Train%>%
  select(`standard`, `reporting category`, `discipline core idea`, `pts`, `year`, `type`, `state%`, `grade_level`)


STE_DF_Train_Y<-STE_DF_Train$school_state_diff

Create Test Set

Code
testIndexDF<-testIndex

#STE_DF_Test<-STE_DF_Test_Full[-STE_DF_Test]


testIndexDF<-as.data.frame(t(testIndex))
#dim(testIndexDF)
#dim(STE_DF_Test)

#Create testing vector

#testIndexDF

testIndexDF<-testIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#testIndexDF<-mutate(testIndexDF, doc_id = str_trim(paste("text", ID, sep="")))%>%
#  select("doc_id")

#STE_DF_Test_Full

STE_DF_Test<-left_join(testIndexDF, STE_DF_Train_Full, by = "id")

STE_DF_Test_X<-STE_DF_Test%>%
select(`standard`, `reporting category`, `discipline core idea`, `pts`, `year`, `type`, `state%`, `grade_level`, model_factor, algebra_factor)

STE_DF_Test_Base_X<-STE_DF_Test%>%
select(`standard`, `reporting category`, `discipline core idea`, `pts`, `year`, `type`, `state%`, `grade_level`)


STE_DF_Test_Y<-STE_DF_Test%>%
  select(`school_state_diff`)


STE_DF_Test_Y
Code
STE_DF_Train_Y
  [1]   7  -5   3  12  -1  -2   2  14   1  -2  -2  -3   6   9  -4  10  -1   6
 [19] -12  15  18  12 -10   9   7  13  17  11   5  15   5 -11   6  -4   7  13
 [37]  13  -2   6   5   9   6   9   2  12  19  19  -7  15  18   1   3  -3 -10
 [55]  10  -2  -8   4  13   9  12  13   1  12 -18   8  11   5   4  -2   0  18
 [73]   3 -15 -15  10  -2  -6  10   8  -2  -5   5  11   4   1  -3   5   4   9
 [91]   1  11  -3   4   0  -2  -5   8   7   6   4  10   1  -5  -5 -20  19  -2
[109]  -3  21  19   9   3   3   1  -9  -8  -3   1   6   6   5  11   1   1   0
[127]   6   6   3   3   7   6  -4   8 -17   5   6  11  11  -1  -1 -12   0  -3
[145]  -2  28   1  -1  -8  23   5  13  -6   0  16  11   2   8  -3  -4  12 -10
[163]   4   9   8   9   1  -5   0   8   2  10  -4   7  -1  -4   5   0   0   4
[181]   1   4  11 -13   1   5   5   5   5   5  -7  19  -1   2   5  10  15   9
[199]   7  28   7   9   4  -1   7   8  13   7 -17   4   3   9  -2  -5  11  11
[217]  10  13 -14  -1   8  -3   2   2 -18   8   1   5   7  10   6   5  14  11
[235]  14   1   1 -11   7   5   0  -5  12  -1  -3   9  11   7  23  -4  13   1
[253]  -5  12   3  -4   4 -13   6  15  -3   4   4  -4   0  10  21   3   0 -12
[271]   1   6   4  19  11   2   8   4  -2  12   8  -2  -2   8   1  -1  12 -13
[289]   5 -15  -4  -3   5  -4  26  -5  13   1  -6   2   2   1  -2  -3  -3  14
[307]  -7   8  16  -4  15   0  16  10   0  22  10  -9   4   4  -2   8   0   9
[325]  11  -1   7   8   4   6  -4   0   7 -17   4  17  -7   3  13   7   8 -10
[343]   0  12  14   0  12  23  -6   2   1   7   3   5   6   0  -2   8   1   8
[361]   6   5  22  -9   8  12 -10   9   3   6  -2  18  -7   6   5  -1 -20   8
[379] -10   7  11   9   4   6   7  10   0   4  10  -8   0  18  10  13  14  -1
[397]   3   4   5   8   9 -20   3  10  14   2  -1   8  20   0   9   2 -15   7
[415]   0  -4
Code
STE_DF_Train_X
Code
STE_DF_Test_X
Code
STE_DF_Test_Base_X

Cross Validation

Tune:Random Forest w/ text features

Code
data_ctrl<-trainControl(method = "cv", number = 5)

STE_DF_Train_Full_Text<-STE_DF_Train_Full%>%
  select(`standard`, `reporting category`, `type`, `year`, `discipline core idea`, `pts`, `model_factor`, `algebra_factor`, `state%`, `grade_level`, `school_state_diff` )

STE_DF_Train_Full_Base<-STE_DF_Train_Full%>%
  select(`standard`, `reporting category`, `type`, `year`, `discipline core idea`, `pts`, `state%`, `grade_level`, `school_state_diff` )
Code
ste_model_caret_rf_text<-train(school_state_diff ~.
                      , data = STE_DF_Train_Full_Text,
                      trControl = data_ctrl,
                      method = "rf")
Code
ste_model_caret_rf_text
Random Forest 

521 samples
 10 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 415, 417, 417, 416, 419 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared   MAE     
    2   7.365803  0.2731094  5.661247
   56   5.492316  0.5578947  3.956106
  111   5.471555  0.5567611  3.926335

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 111.
Code
ste_model_caret_rf_text$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 111

          Mean of squared residuals: 20.86243
                    % Var explained: 68.71
Code
ste_model_caret_rf<-train(school_state_diff ~.
                      , data = STE_DF_Train_Full_Base,
                      trControl = data_ctrl,
                      method = "rf")
Code
ste_model_caret_rf$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 55

          Mean of squared residuals: 21.65668
                    % Var explained: 67.52

Random Forest Model

Regression Base Model

Code
set.seed(444)

STE_diff_RF<-randomForest(STE_DF_Train_Base_X,
                      y = STE_DF_Train$school_state_diff,
                      xtest = STE_DF_Test_Base_X,
                      ytest = STE_DF_Test$school_state_diff,
                      mtry = 3,
                      importance = TRUE,
                      type="regression",
                      ntree = 500)


STE_diff_RF

Call:
 randomForest(x = STE_DF_Train_Base_X, y = STE_DF_Train$school_state_diff,      xtest = STE_DF_Test_Base_X, ytest = STE_DF_Test$school_state_diff,      ntree = 500, mtry = 3, importance = TRUE, type = "regression") 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 3

          Mean of squared residuals: 30.42441
                    % Var explained: 55.14
                       Test set MSE: 44.26
                    % Var explained: 29.05
Code
varImpPlot(STE_diff_RF)

Random Forest Model w/ text features

Regression

Code
set.seed(444)

STE_diff_RF_Text<-randomForest(STE_DF_Train_X,
                      y = STE_DF_Train$school_state_diff,
                      xtest = STE_DF_Test_X,
                      ytest = STE_DF_Test$school_state_diff,
                      mtry = 4,
                      importance = TRUE,
                      type="regression",
                      ntree = 500)


STE_diff_RF_Text

Call:
 randomForest(x = STE_DF_Train_X, y = STE_DF_Train$school_state_diff,      xtest = STE_DF_Test_X, ytest = STE_DF_Test$school_state_diff,      ntree = 500, mtry = 4, importance = TRUE, type = "regression") 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 4

          Mean of squared residuals: 29.32
                    % Var explained: 56.77
                       Test set MSE: 42.17
                    % Var explained: 32.39
Code
varImpPlot(STE_diff_RF_Text)

English Language Arts Exam

Create Corpus

Code
ELA_item_corpus <-corpus(ELA_DF, text_field = "item description")

#print(ELA_item)

summary(ELA_item_corpus)
Code
#ELA_DF

PreText PreProcessing Decisions

Before completing the pre-processing process, I examined the different choices using pre-Text. I was surprised to see that removing stopwords had a positive correlation coefficient; yet the combination “P-W” had the lowest score of the Pre-text results. I can see how key logical words like “not”, “and”, and “or”, which are also stop words can have a significant impact on the meaning of an exam question. Perhaps, because each individual text is so small and the texts are designed for assessing content skills and are not narrative text, the stop words play more significant roles?

Given, these results, I will pre-process the data two ways, once using the lowest recommended score and once including lower case and lemmatization to use for visualzations.

  • Recommended preText score: “S-W-3” (remove stop words, Lemmatization/Stemming)

  • Alternative approach: “P-N-W-L” and Lemmatization (remove punctuation, remove numbers, lower case, and lemmatization)

Also, due to teacher input I am also creating a subcorpus of items that assessed Literature and items that assessed Informational text. Teachers describe having to change their approach to teaching these things and students having different weaknesses according to text type.

Code
# Sys.unsetenv("GITHUB_PAT")
# devtools::install_github("matthewjdenny/preText")
# library(preText)
Code
# preprocessed_documents_ELA <- factorial_preprocessing(
#     ELA_item_corpus,
#     use_ngrams = TRUE,
#     infrequent_term_threshold = 0.2,
#     verbose = FALSE)
Code
# names(preprocessed_documents_ELA)
# 
# head(preprocessed_documents_ELA$choices)
Code
# preText_results <- preText(
#     preprocessed_documents_ELA,
#     dataset_name = "ELA MCAS Item Descriptions",
#     distance_method = "cosine",
#     num_comparisons = 20,
#     verbose = FALSE)
# 
# 
# preText_score_plot(preText_results)
Code
# regression_coefficient_plot(preText_results,
#                             remove_intercept = TRUE)

Tokenization 1: L-W-Lemma

Code
## Extract the tokens

ELA_item_tokens <- tokens(ELA_item_corpus)

print(ELA_item_tokens)
Tokens consisting of 693 documents and 13 docvars.
text1 :
 [1] "Determine"  "the"        "effect"     "of"         "the"       
 [6] "repetition" "of"         "a"          "word"       "using"     
[11] "evidence"   "from"      
[ ... and 3 more ]

text2 :
 [1] "Determine" "what"      "event"     "will"      "happen"    "next"     
 [7] "using"     "evidence"  "from"      "the"       "passage"   "."        

text3 :
 [1] "Determine"  "the"        "effect"     "of"         "figurative"
 [6] "language"   "in"         "a"          "passage"    "."         

text4 :
 [1] "Identify"   "the"        "purpose"    "of"         "an"        
 [6] "author's"   "use"        "of"         "literary"   "techniques"
[11] "in"         "a"         
[ ... and 2 more ]

text5 :
 [1] "Make"        "an"          "inference"   "about"       "how"        
 [6] "mood"        "is"          "established" "in"          "a"          
[11] "passage"     "."          

text6 :
 [1] "Determine" "the"       "effect"    "of"        "the"       "structure"
 [7] "of"        "a"         "section"   "in"        "a"         "passage"  
[ ... and 1 more ]

[ reached max_ndoc ... 687 more documents ]
Code
ELA_item_tokens1 <- tokens_tolower(ELA_item_tokens)

ELA_item_tokens1 <- tokens_select(ELA_item_tokens1,
                   pattern = stopwords("en"),
                  selection = "remove")

print(ELA_item_tokens1)
Tokens consisting of 693 documents and 13 docvars.
text1 :
[1] "determine"  "effect"     "repetition" "word"       "using"     
[6] "evidence"   "passage"    "."         

text2 :
[1] "determine" "event"     "happen"    "next"      "using"     "evidence" 
[7] "passage"   "."        

text3 :
[1] "determine"  "effect"     "figurative" "language"   "passage"   
[6] "."         

text4 :
[1] "identify"   "purpose"    "author's"   "use"        "literary"  
[6] "techniques" "passage"    "."         

text5 :
[1] "make"        "inference"   "mood"        "established" "passage"    
[6] "."          

text6 :
[1] "determine" "effect"    "structure" "section"   "passage"   "."        

[ reached max_ndoc ... 687 more documents ]

Tokenization 2: P-N-L-W & Lemmatization

Code
# remove punctuation and numbers
ELA_item_tokens2 <- tokens(ELA_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

# remove stopwords

ELA_item_tokens2 <- tokens_select(ELA_item_tokens2,
                   pattern = stopwords("en"),
                  selection = "remove")

# lower case

ELA_item_tokens2 <- tokens_tolower(ELA_item_tokens2)




print(ELA_item_tokens2)
Tokens consisting of 693 documents and 13 docvars.
text1 :
[1] "determine"  "effect"     "repetition" "word"       "using"     
[6] "evidence"   "passage"   

text2 :
[1] "determine" "event"     "happen"    "next"      "using"     "evidence" 
[7] "passage"  

text3 :
[1] "determine"  "effect"     "figurative" "language"   "passage"   

text4 :
[1] "identify"   "purpose"    "author's"   "use"        "literary"  
[6] "techniques" "passage"   

text5 :
[1] "make"        "inference"   "mood"        "established" "passage"    

text6 :
[1] "determine" "effect"    "structure" "section"   "passage"  

[ reached max_ndoc ... 687 more documents ]

lemmatization

Code
lem_ELA_item_tokens2<-tokens_replace(ELA_item_tokens2,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

lem_ELA_item_tokens2
Tokens consisting of 693 documents and 13 docvars.
text1 :
[1] "determine"  "effect"     "repetition" "word"       "use"       
[6] "evidence"   "passage"   

text2 :
[1] "determine" "event"     "happen"    "next"      "use"       "evidence" 
[7] "passage"  

text3 :
[1] "determine"  "effect"     "figurative" "language"   "passage"   

text4 :
[1] "identify"  "purpose"   "author's"  "use"       "literary"  "technique"
[7] "passage"  

text5 :
[1] "make"      "inference" "mood"      "establish" "passage"  

text6 :
[1] "determine" "effect"    "structure" "section"   "passage"  

[ reached max_ndoc ... 687 more documents ]
Code
lem_ELA_item_tokens1<-tokens_replace(ELA_item_tokens1,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

lem_ELA_item_tokens1
Tokens consisting of 693 documents and 13 docvars.
text1 :
[1] "determine"  "effect"     "repetition" "word"       "use"       
[6] "evidence"   "passage"    "."         

text2 :
[1] "determine" "event"     "happen"    "next"      "use"       "evidence" 
[7] "passage"   "."        

text3 :
[1] "determine"  "effect"     "figurative" "language"   "passage"   
[6] "."         

text4 :
[1] "identify"  "purpose"   "author's"  "use"       "literary"  "technique"
[7] "passage"   "."        

text5 :
[1] "make"      "inference" "mood"      "establish" "passage"   "."        

text6 :
[1] "determine" "effect"    "structure" "section"   "passage"   "."        

[ reached max_ndoc ... 687 more documents ]

Create DFMs

Code
df_ELA_toks2<-dfm(lem_ELA_item_tokens2)

df_ELA_toks1<-dfm(lem_ELA_item_tokens1)

n-grams

When I examine the phrases in the first ten documents of weak items, the weakest documents seem to consistently involve the analysis of graphs/diagrams/models. I would like to capture this as a text feature.

Code
library(phrasemachine) 
documents_weak_items<-as.character(corpus_subset(ELA_item_corpus, school_state_diff < -2))[10:15]

#documents_weak_items

phrases<-phrasemachine(documents_weak_items, minimum_ngram_length = 2,
                         maximum_ngram_length = 4,
                         return_phrase_vectors = TRUE,
                         return_tag_sequences = TRUE)
Currently tagging document 1 of 6 
Currently tagging document 2 of 6 
Currently tagging document 3 of 6 
Currently tagging document 4 of 6 
Currently tagging document 5 of 6 
Currently tagging document 6 of 6 
Extracting phrases from document 1 of 6 
Extracting phrases from document 2 of 6 
Extracting phrases from document 3 of 6 
Extracting phrases from document 4 of 6 
Extracting phrases from document 5 of 6 
Extracting phrases from document 6 of 6 
Code
# look at some example phrases
print(phrases[[1]]$phrases[1:20])
 [1] "function_of_a_word" "word_in_a_passage"  NA                  
 [4] NA                   NA                   NA                  
 [7] NA                   NA                   NA                  
[10] NA                   NA                   NA                  
[13] NA                   NA                   NA                  
[16] NA                   NA                   NA                  
[19] NA                   NA                  
Code
documents_weak_items
                                                                                     text171 
                                              "Analyze the function of a word in a passage." 
                                                                                     text181 
"Determine how a particular paragraph contributes to the development of ideas in a passage." 
                                                                                     text191 
                                      "Identify context to support the meaning of a phrase." 
                                                                                     text217 
                          "Analyze how an author uses point of view to develop a character." 
                                                                                     text231 
    "Explain how particular sections of a passage contribute to the reader's understanding." 
                                                                                     text232 
                    "Explain how details in a poem reveal the point of view of the speaker." 

Worst performing Items

Literature

Code
ELA_worst_lit_item_corpus<- corpus_subset(ELA_item_corpus, school_state_diff < 2 & text_type == "Literature" )

# remove punctuation and numbers
ELA_worst_lit_tokens <- tokens(ELA_worst_lit_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

ELA_worst_lit_tokens <- tokens_tolower(ELA_worst_lit_tokens)


ELA_worst_lit_tokens <-  tokens_select(ELA_worst_lit_tokens,
                   pattern = stopwords("en"),
                  selection = "remove")

lem_ELA_worst_lit_tokens<-tokens_replace(ELA_worst_lit_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_ELA_worst_lit_tokens<-dfm(lem_ELA_worst_lit_tokens)
df_ELA_worst_lit_toks_smaller<- dfm_trim(df_ELA_worst_lit_tokens, min_docfreq = 0.09, docfreq_type = "prop")

Informational

Code
ELA_worst_inf_item_corpus<- corpus_subset(ELA_item_corpus, school_state_diff < 2 & text_type == "Informational" )

# remove punctuation and numbers
ELA_worst_inf_tokens <- tokens(ELA_worst_inf_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

ELA_worst_inf_tokens <- tokens_tolower(ELA_worst_inf_tokens)


ELA_worst_inf_tokens <-  tokens_select(ELA_worst_inf_tokens,
                   pattern = stopwords("en"),
                  selection = "remove")

lem_ELA_worst_inf_tokens<-tokens_replace(ELA_worst_inf_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_ELA_worst_inf_tokens<-dfm(lem_ELA_worst_inf_tokens)
df_ELA_worst_inf_toks_smaller<- dfm_trim(df_ELA_worst_inf_tokens, min_docfreq = 0.09, docfreq_type = "prop")

Best performing Items

Literature

Code
ELA_best_lit_item_corpus<- corpus_subset(ELA_item_corpus, school_state_diff > 5 & text_type == "Literature" )

# remove punctuation and numbers
ELA_best_lit_tokens <- tokens(ELA_best_lit_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

ELA_best_lit_tokens <- tokens_tolower(ELA_best_lit_tokens)


ELA_best_lit_tokens <-  tokens_select(ELA_best_lit_tokens,
                   pattern = stopwords("en"),
                  selection = "remove")

lem_ELA_best_lit_tokens<-tokens_replace(ELA_best_lit_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_ELA_best_lit_tokens<-dfm(lem_ELA_best_lit_tokens)
df_ELA_best_lit_toks_smaller<- dfm_trim(df_ELA_best_lit_tokens, min_docfreq = 0.09, docfreq_type = "prop")

Informational

Code
ELA_best_inf_item_corpus<- corpus_subset(ELA_item_corpus, school_state_diff > 5 & text_type == "Informational" )

# remove punctuation and numbers
ELA_best_inf_tokens <- tokens(ELA_best_inf_item_corpus, 
    remove_punct = T,
    remove_numbers = T)

ELA_best_inf_tokens <- tokens_tolower(ELA_best_inf_tokens)


ELA_best_inf_tokens <-  tokens_select(ELA_best_inf_tokens,
                   pattern = stopwords("en"),
                  selection = "remove")

lem_ELA_best_inf_tokens<-tokens_replace(ELA_best_inf_tokens,
                             pattern = lexicon:: hash_lemmas$token,
                             replacement = lexicon:: hash_lemmas$lemma)

df_ELA_best_inf_tokens<-dfm(lem_ELA_best_inf_tokens)
df_ELA_best_inf_toks_smaller<- dfm_trim(df_ELA_best_inf_tokens, min_docfreq = 0.09, docfreq_type = "prop")

Descriptive Analysis

When I examine the top features, it is notable that the words evidence and support appear as top features in informational text but not in literature. The word poem appears much higher and more frequently as a top feature of in worst literature items*.

All Items: Top Features

Code
topfeatures(df_ELA_toks2, 20)
  passage determine  identify   analyze character      idea   article   support 
      388       292       171       142       133       111       105        96 
     mean       two paragraph       use  specific      poem      word    detail 
       92        88        87        85        85        83        80        78 
  excerpt  evidence inference      make 
       77        74        72        68 
Code
#topfeatures(df_Math_toks1, 20)

Lit: Worstperforming Items: Top Features

Code
topfeatures(df_ELA_worst_lit_toks_smaller, 20)
  passage determine      poem character   analyze  identify   excerpt    detail 
       45        38        23        22        21        19        18        16 
      two  specific       use paragraph      make inference      base      idea 
       14        13        12        12        12        12        11        10 
    theme   develop 
        9         9 

Lit Best performing Items: Top Features

Code
topfeatures(df_ELA_best_lit_toks_smaller, 20)
  passage determine character   analyze  identify    detail inference      make 
      134        83        72        45        41        32        27        26 
      use      poem  specific      mean      base paragraph      word   support 
       25        25        25        24        23        21        20        18 
      two 
       18 

Inf: Worstperforming Items: Top Features

Code
topfeatures(df_ELA_worst_inf_toks_smaller, 20)
  article determine  identify   passage      idea   support  evidence paragraph 
       32        29        25        25        19        14        12        11 
      use   purpose  sentence   section      mean  specific   context   analyze 
       10        10         9         9         9         9         8         8 
      two     claim      word   similar 
        7         7         7         7 

Inf Best performing Items: Top Features

Code
topfeatures(df_ELA_best_inf_toks_smaller, 20)
  passage determine  identify   article      idea   support       two      mean 
       72        67        41        41        36        27        25        24 
  purpose   excerpt      word paragraph   similar   analyze     topic  evidence 
       22        22        21        21        21        21        20        19 
  context       use 
       16        14 

Visualizations

Code
#ELA_DF

ELA_DF%>%
  group_by(`reporting category`)%>%
  summarize(mean_diff = mean(school_state_diff))
Code
ELA_DF%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`text_type`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle(" ELA MCAS: Student Performance by Reporting Category") 

Code
ELA_DF%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`Cluster`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("ELA MCAS: Student Performance by Reporting Category")

Code
ELA_DF%>%
  filter(text_type == "Literature")%>%
  ggplot( aes(x=`reporting category`, y=`school_state_diff`, fill=`Cluster`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11)
    ) +
    ggtitle("ELA MCAS: Student Performance by Reporting Category") +
    labs(caption = "Literature") 

Code
ELA_DF%>%
  filter(`reporting category` == "LA")%>%
  ggplot( aes(x=`Cluster`, y=`school_state_diff`, fill=`text_type`)) +
    geom_boxplot() +
    scale_fill_viridis(discrete = TRUE, alpha=0.6) +
    geom_jitter(color="black", size=0.4, alpha=0.9) +
    theme_ipsum() +
    theme(
      #legend.position="none",
      plot.title = element_text(size=11),
      axis.text.x = element_text(angle = 45, hjust = 1)
    ) +
    ggtitle("ELA MCAS: Student Performance by Language Cluster and Text Type") 

Word Cloud

Code
#dfm_item
textplot_wordcloud(df_ELA_toks2)

Code
smaller_dfm <-dfm_trim(df_ELA_toks2, min_termfreq = 5)
smaller_dfm <- dfm_trim(smaller_dfm, min_docfreq = 0.05, docfreq_type = "prop")

textplot_wordcloud(smaller_dfm, min_count =3, random_order = FALSE)

Word Connections

Literature

Best Items
Code
### Word connections for worst items

dim(df_ELA_best_lit_toks_smaller)
[1] 196  17
Code
# compute size weight for vertices in network
size <- log(colSums(df_ELA_best_lit_toks_smaller))

# create plot
textplot_network(df_ELA_best_lit_toks_smaller, vertex_size = size/ max(size) * 3)

Worst Items
Code
### Word connections for worst items

dim(df_ELA_worst_lit_toks_smaller)
[1] 93 18
Code
# compute size weight for vertices in network
size <- log(colSums(df_ELA_worst_lit_toks_smaller))

# create plot
textplot_network(df_ELA_worst_lit_toks_smaller, vertex_size = size/ max(size) * 3)

Word Connections

Informational

Visual Best Items
Code
### Word connections for worst items

dim(df_ELA_best_inf_toks_smaller)
[1] 146  18
Code
# compute size weight for vertices in network
size <- log(colSums(df_ELA_best_inf_toks_smaller))

# create plot
textplot_network(df_ELA_best_inf_toks_smaller, vertex_size = size/ max(size) * 3)

Worst Items
Code
### Word connections for best items

dim(df_ELA_worst_inf_toks_smaller)
[1] 63 22
Code
# compute size weight for vertices in network
size <- log(colSums(df_ELA_worst_inf_toks_smaller))

# create plot
textplot_network(df_ELA_worst_inf_toks_smaller, vertex_size = size/ max(size) * 3)

Dictionary to Create New Features

Dictionary: Key ELA Terms:

Code
my_ela_dict <- dictionary(list(poetry=c("poem","stanza", "lyric", "verse"),
                               evidence = c("claim", "support", "evidence", "supports", "purpose"),
                               fig_language = c("figurative language", "simile", "metaphor")))

my_ela_dict
Dictionary object with 3 key entries.
- [poetry]:
  - poem, stanza, lyric, verse
- [evidence]:
  - claim, support, evidence, supports, purpose
- [fig_language]:
  - figurative language, simile, metaphor
Code
ELA_Toks1_mydict <- df_ELA_toks1 %>%
  dfm_lookup(my_ela_dict)

tail(ELA_Toks1_mydict, 10)
Document-feature matrix of: 10 documents, 3 features (90.00% sparse) and 13 docvars.
         features
docs      poetry evidence fig_language
  text684      0        1            0
  text685      0        0            0
  text686      0        0            0
  text687      0        0            0
  text688      0        0            0
  text689      0        0            0
[ reached max_ndoc ... 4 more documents ]
Code
df_ELADict <- convert(ELA_Toks1_mydict, to = "data.frame")



df_ELADict
Code
#Math_DF<-left_join(df_MathDict, Math_DF, by = "doc_id")
Code
ELA_DF<-mutate(ELA_DF, ID = as.character(row_number()))

ELA_DF<-mutate(ELA_DF, doc_id = str_trim(paste("text", ID, sep="")))

ELA_DF<-left_join(df_ELADict, ELA_DF, by = "doc_id")

ELA_DF
Code
ELA_DF<-ELA_DF%>%
  mutate(poetry_factor = case_when(
    poetry == 0 ~ "no",
    poetry > 0 ~ "yes"
  ))%>%
    mutate(fig_language_factor = case_when(
    fig_language == 0 ~ "no",
    fig_language > 0 ~ "yes"
  ))%>%
    
    mutate(evidence_factor = case_when(
    evidence == 0 ~ "no",
    evidence > 0 ~ "yes"
  ))

Machine Learning

Code
ELA_DF
Code
ELA_DF_Train_Full<-ELA_DF%>%
   mutate(year = as.factor(year))%>%
#   mutate(`school_state_bool` = case_when(
#     `school_state_diff` <= 0 ~ "negative",
#     `school_state_diff` > 0 ~ "positive")
#   )%>%
  filter(grade_level == 10| grade_level ==8)%>%
  # filter(grade_level == 10 | grade_level == 8| grade_level == 7| grade_level == 6|grade_level == 5)%>%
  mutate(id = row_number())%>%
   select(`id`, `standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`, `school_state_diff`, `grade_level`, `poetry_factor`, fig_language_factor, evidence_factor)
# 
 ELA_DF_Train_Full

Create Training Set

Code
# set seed
set.seed(12345)

N<-nrow(ELA_DF_Train_Full)

trainIndex <- sample(1:N, .8*N)

testIndex<-c(1:N)[-trainIndex]

# check length of training set
#length(trainIndex)

trainIndexDF<-trainIndex

#Math_DF_Test<-Math_DF_Train_Full[-Math_DF_Train]


trainIndexDF<-as.data.frame(t(trainIndex))
dim(trainIndexDF)
[1]   1 222
Code
#dim(Math_DF_Test)

#Create training vector

#trainIndexDF

trainIndexDF<-trainIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#trainIndexDF<-mutate(trainIndexDF, doc_id = str_trim(paste("text", ID, #sep="")))%>%
#  select("doc_id")

#trainIndexDF

#ELA_DF_Train_Full

ELA_DF_Train<-left_join(trainIndexDF, ELA_DF_Train_Full, by = "id")

ELA_DF_Train
Code
ELA_DF_Train_X<-ELA_DF_Train%>%
  select(`standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`,  `grade_level`, `poetry_factor`, fig_language_factor, evidence_factor)

ELA_DF_Train_Base_X<-ELA_DF_Train%>%
  select(`standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`,  `grade_level`)


ELA_DF_Train_Y<-ELA_DF_Train$school_state_diff

Create Test Set

Code
testIndexDF<-testIndex

#ELA_DF_Test<-ELA_DF_Test_Full[-ELA_DF_Test]


testIndexDF<-as.data.frame(t(testIndex))
#dim(testIndexDF)
#dim(ELA_DF_Test)

#Create testing vector

#testIndexDF

testIndexDF<-testIndexDF%>%
  pivot_longer(cols = starts_with("v"), names_to = "value_id", values_to = "id")
#testIndexDF<-mutate(testIndexDF, doc_id = str_trim(paste("text", ID, sep="")))%>%
#  select("doc_id")

#ELA_DF_Test_Full

ELA_DF_Test<-left_join(testIndexDF, ELA_DF_Train_Full, by = "id")

ELA_DF_Test_X<-ELA_DF_Test%>%
  select(`standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`,  `grade_level`, `poetry_factor`, fig_language_factor, evidence_factor)

ELA_DF_Test_Base_X<-ELA_DF_Test%>%
  select(`standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`,  `grade_level`)


ELA_DF_Test_Y<-ELA_DF_Test$school_state_diff

Cross Validation

Tune:Random Forest w/ text features

Code
data_ctrl<-trainControl(method = "cv", number = 5)

ELA_DF_Train_Full_Text<-ELA_DF_Train_Full%>%
  select(`standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`, `school_state_diff`, `grade_level`, `poetry_factor`, fig_language_factor, evidence_factor, `school_state_diff` )

ELA_DF_Train_Full_Base<-ELA_DF_Train_Full%>%
  select(`standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`, `school_state_diff`, `grade_level`, `school_state_diff` )
Code
ela_model_caret_rf_text<-train(school_state_diff ~.
                      , data = ELA_DF_Train_Full_Text,
                      trControl = data_ctrl,
                      method = "rf")
Code
ela_model_caret_rf_text
Random Forest 

278 samples
 10 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 222, 224, 222, 222, 222 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared    MAE     
   2    5.537925  0.05685022  4.316009
  17    5.712426  0.03901883  4.404182
  32    5.776146  0.04145591  4.452287

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
Code
ela_model_caret_rf_text$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 2

          Mean of squared residuals: 31.4449
                    % Var explained: 2.42

Tune:Base Random Forest

Code
ela_model_caret_rf<-train(school_state_diff ~.
                      , data = ELA_DF_Train_Full_Base,
                      trControl = data_ctrl,
                      method = "rf")
Code
ela_model_caret_rf
Random Forest 

278 samples
  7 predictor

No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 224, 222, 222, 221, 223 
Resampling results across tuning parameters:

  mtry  RMSE      Rsquared    MAE     
   2    5.481401  0.09159678  4.262053
  15    5.621994  0.06526151  4.346481
  29    5.755399  0.05006389  4.434729

RMSE was used to select the optimal model using the smallest value.
The final value used for the model was mtry = 2.
Code
ela_model_caret_rf$finalModel

Call:
 randomForest(x = x, y = y, mtry = param$mtry) 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 2

          Mean of squared residuals: 31.02473
                    % Var explained: 3.72
Code
set.seed(444)

ELA_diff_RF<-randomForest(ELA_DF_Train_Base_X,
                      y = ELA_DF_Train$school_state_diff,
                      xtest = ELA_DF_Test_Base_X,
                      ytest = ELA_DF_Test$school_state_diff,
                      mtry = 2,
                      importance = TRUE,
                      type="regression",
                      ntree = 500)
Code
ELA_diff_RF

Call:
 randomForest(x = ELA_DF_Train_Base_X, y = ELA_DF_Train$school_state_diff,      xtest = ELA_DF_Test_Base_X, ytest = ELA_DF_Test$school_state_diff,      ntree = 500, mtry = 2, importance = TRUE, type = "regression") 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 2

          Mean of squared residuals: 31.74277
                    % Var explained: -2.01
                       Test set MSE: 43.24
                    % Var explained: -17.37
Code
#predicted<-predict(ELA_diff_RF, ELA_DF_Test_X)
#mae(predicted, ELA_DF_Test$school_state_diff)

#actual <- ELA_DF_Test$school_state_diff




#R2 <- 1 - (sum((actual-predicted)^2)/sum((actual-mean(actual))^2))

#R2

# mean(abs(actual-predicted))
# 
# mean(abs(actual))
# sd(abs(actual))
# sd(abs(actual-predicted))
Code
varImpPlot(ELA_diff_RF)

RF Regression w/text

Code
set.seed(444)

ELA_diff_RF_text<-randomForest(x = ELA_DF_Train_X,
                      y = ELA_DF_Train$school_state_diff,
                      xtest = ELA_DF_Test_X,
                      ytest = ELA_DF_Test$school_state_diff,
                      mtry = 2,
                      importance = TRUE,
                      type="regression",
                      ntree = 500)
Code
ELA_diff_RF_text

Call:
 randomForest(x = ELA_DF_Train_X, y = ELA_DF_Train$school_state_diff,      xtest = ELA_DF_Test_X, ytest = ELA_DF_Test$school_state_diff,      ntree = 500, mtry = 2, importance = TRUE, type = "regression") 
               Type of random forest: regression
                     Number of trees: 500
No. of variables tried at each split: 2

          Mean of squared residuals: 31.29259
                    % Var explained: -0.56
                       Test set MSE: 41.04
                    % Var explained: -11.41
Code
#predicted<-predict(ELA_diff_RF, ELA_DF_Test_X)
#mae(predicted, ELA_DF_Test$school_state_diff)

#actual <- ELA_DF_Test$school_state_diff




#R2 <- 1 - (sum((actual-predicted)^2)/sum((actual-mean(actual))^2))

#R2

# mean(abs(actual-predicted))
# 
# mean(abs(actual))
# sd(abs(actual))
# sd(abs(actual-predicted))
Code
varImpPlot(ELA_diff_RF_text)

Support Vector Regression

###Base SVM Model

Code
ELA_SVM_Train<-ELA_DF_Train%>%
  select(`standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`, `grade_level`, school_state_diff)
#Math_DF_Train_X
#Math_DF_Train$school_state_diff

svm_ela_base<-svm(school_state_diff ~., data = ELA_SVM_Train, scale = FALSE, kernel = "linear")
                     
svm_ela_base

Call:
svm(formula = school_state_diff ~ ., data = ELA_SVM_Train, kernel = "linear", 
    scale = FALSE)


Parameters:
   SVM-Type:  eps-regression 
 SVM-Kernel:  linear 
       cost:  1 
      gamma:  0.03333333 
    epsilon:  0.1 


Number of Support Vectors:  217
Code
svm_ela_pred<-predict(svm_ela_base, ELA_DF_Test_X)

 mae(svm_ela_pred, ELA_DF_Test$school_state_diff)
[1] 4.650072
Code
 mean(abs(ELA_DF_Test$school_state_diff))
[1] 6.75
Code
  sd(abs(ELA_DF_Test$school_state_diff))
[1] 4.366192
Code
# 
Code
svmCoefs <- as.data.frame(t(coefficients(svm_ela_base)))

svmCoefs

SVM w/text feature Model

Code
ELA_SVM_Text_Train<-ELA_DF_Train%>%
  select(`standard`, `reporting category`, `Cluster`, `text_type`, `year`, `state%`, `grade_level`, school_state_diff, poetry_factor)
#Math_DF_Train_X
#Math_DF_Train$school_state_diff

svm_ela_text<-svm(school_state_diff ~., data = ELA_SVM_Text_Train, scale = FALSE, kernel = "linear")
                     
svm_ela_text

Call:
svm(formula = school_state_diff ~ ., data = ELA_SVM_Text_Train, kernel = "linear", 
    scale = FALSE)


Parameters:
   SVM-Type:  eps-regression 
 SVM-Kernel:  linear 
       cost:  1 
      gamma:  0.03225806 
    epsilon:  0.1 


Number of Support Vectors:  219
Code
ELA_DF_Test
Code
svm_ela_pred<-predict(svm_ela_base, ELA_DF_Test_X)

 mae(svm_ela_pred, ELA_DF_Test$school_state_diff)
[1] 4.650072
Code
 mean(abs(ELA_DF_Test$school_state_diff))
[1] 6.75
Code
  sd(abs(ELA_DF_Test$school_state_diff))
[1] 4.366192
Code
# 
Code
svmCoefs <- as.data.frame(t(coefficients(svm_ela_base)))

svmCoefs

Results

Math

  • Random Forest Regression Model w/ parameters from the state item reports tuned with 5-fold cross-validation explained 8% of the variance in the training set and 5% of the variance in the testing set

  • Random Forest Regression Model including 3 additional text extracted features w/ parameters tuned with 5-fold cross-validation explained 11% of the variance in the training set and 9% of the variance in the testing set

    • The text–extracted feature: model_factor played a more significant role than the state provided variable type in the variable importance plot.

STE

  • Random Forest Regression Model w/ parameters from the state item reports tuned with 5-fold cross-validation explained 55% of the variance in the training set and 29% of the variance in the testing set

  • Random Forest Regression Model including 2 additional text extracted features w/ parameters tuned with 5-fold cross-validation explained 57% of the variance in the training set and 32% of the variance in the testing set

    • The text–extracted feature: model_factor played a more significant role than the state provided variables pts and type in the variable importance plot.

ELA

The ELA model was not able to explain variance in student performance; however the model with that included text extracted features performed better than the model w/out those features.

  • Random Forest Regression Model w/ parameters from the state item reports tuned with 5-fold cross-validation explained -2% of the variance in the training set and -17% of the variance in the testing set

  • Random Forest Regression Model including 2 additional text extracted features w/ parameters tuned with 5-fold cross-validation explained 0% of the variance in the training set and -10% of the variance in the testing set

    • The text–extracted feature: poetry_factor and figurative_language_factor played a more significant role than most of the state provided variables in the variable importance plot. Notably the variable grade_level had the highest significance. This is interesting as the standards are the same in ELA across all grade levels. In our school there is one teacher for each grade level. This seems to suggest that in English Language Arts there is a broad inconsistency in how the teachers are addressing the standards within their courses.

Questions


At this phase, you might have more questions than answers. Document all those questions.

  1. For each subject, I ran cross validation to tune the parameters of a Random Forest models that did not include text features and to tune the parameters of a Random Forest model that did include text features. Then I compared the %variance explained and the RMSE and MAE as well as differences in the variable importance plot. I am assuming these are the key items to discuss in my “Results” section. Is there anything else I should be adding. I wrote a few bullet points for now that I will have to flesh out in my final submission.

  2. I tried a SVM regression model; however because there are so many different standards, I received error messages when my test set contained values of the variable standard that weren’t present in the training set. Is there a way to work around this? This was an issue for the Science and the Math tests but not for the ELA test since the standards are the same across all the grades. I can use a random forest for all of the tests without issue. Is it ok to just stick with the Random Forest Regressions?

  3. The model explained much more of the variance in the student performance in the Science Exam than the other exams. This was an interesting surprise/; is it ok to conjecture ideas as to why that could be true in my writing?

  4. I have tried to comment out many of the lines of code that were “sanity checks” for me and produced tables. When I create tokens, I have been printing the summary of the tokens. This takes up a lot of space. Is this something you want to see in the paper to get a sense of the corpus, or should I comment out these summaries?